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deta, then, emerges as that relatively primitive, 


partly autonomous, institutionalized, ratiomorphic subsystem of cognition 
which achieves prompt and richly detailed orientation habitually concerning 
the vitally relevant, mostly distal aspects of the environment on the basis 
of mutually vicarious, relatively restricted and stereotyped, insufficient 
evidence in uncertainty-geared interaction and compromise, seemingly 
following the highest probability for smallness of error at the expense of 
the highest frequency of precision." ----- From "Perception and the 
Representative Design of Psychological Experiments," by Egon Brunswik. 


"That's a simplification. Perception is standing on the side- 
walk, watching all the girls go by." ----- From "The New Yorker", 
December 19, 1959. 
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PREFACE 


It is only after much hesitation that the writer has reconciled him- 
self to the addition of the term ''neurodynamics" to the list of such recent 
linguistic artifacts as "cybernetics", "bionics', '"autonomics", 'biomimesis", 
"synnoetics'", "intelectronics", and "robotics". It is hoped that by selecting 
aterm which more clearly delimits our realm of interest and indicates its 
relationship to traditional academic disciplines, the underlying motivation of 
the perceptron program may be more successfully communicated. The term 
"perceptron'', originally intended as a generic name for a variety of theoretical 
nerve nets, has an unfortunate tendency to suggest a specific piece of hardware, 
and it is only with difficulty that its well-meaning popularizers can be persuaded 
to suppress their natural urge to capitalize the initial "P"". On being asked, 
"How is Perceptron performing today?" I am often tempted to respond, ''Very 
well, thank you, and how are Neutron and Electron behaving?" 


That the aims and methods of perceptron research are in need of 
clarification is apparent from the extent of the controversy within the scientific 
community since 1957, concerning the value of the perceptron concept. There 
seem to have been at least three main reasons for negative reactions to the 
program. First, was the admitted lack of mathematical rigor in preliminary re- 
ports. Second, was the handling of the first public announcement of the program 
in 1958 by the popular press, which fell to the task with all of the exuberance and 
sense of discretion of a pack of happy bloodhounds. Such headlines as ''Franken- 
stein Monster Designed by Navy Robot That Thinks" (Tulsa, Oklahoma Times) 
were hardly designed to inspire scientific confidence. Third, and perhaps most 
significant, there has been a failure to comprehend the difference in motivation 
between the perceptron program and the various engineering projects concerned 
with automatic pattern recognition, "artificial intelligence’, and advanced computers. 


For this writer, the perceptron program is not primarily concerned with the inven- 


Google 


tion of devices for "artificial intelligence", but rather with investigating the 
physical structures and neurodynamic principles which underlie "natural 
intelligence’. A perceptron is first and foremost a brain model, not an inven - 
tion for pattern recognition. As a brain model, its utility is in enabling us to 
determine the physical conditions for the emergence of various psychological 
properties. It is by no means a ''complete" model, and we are fully aware of 
the simplifications which have been made from biological systems; but it is, 

at least, an analyzable model. The results of this approach have already been 
substantial; a number of fundamental principles have been established, which 
are presented in this report, and these principles may be freely applied, 
wherever they prove useful, by inventors of pattern recognition machines and 


artificial intelligence systems. 


The purpose of this report is to set forth the principles, motivation, 
and accomplishments of perceptron theory in their entirety, and to provide a 
self-sufficient text for those who are interested in a serious study of neuro- 
dynamics. The writer is convinced that this is as definitive a treatment as can 
reasonably be accomplished in a volume of managable sise. Since this volume 
attempts to present a consistent theoretical position, however, the student 
would be well advised to round out his reading with several of the alternative 
approaches referenced in PartI. Within the last year, a number of comprehen- 
sive reviews of the literature have appeared, which provide convenient jumping - 
off points for such a study. * 


The work reported here has been performed jointly at the Cornell 
Aeronautical Laboratory in Buffalo and at Cornell University in Ithaca. Both 
programs have been under the support of the Information Systems Branch of the 
Office of Naval Research -- the Buffalo program since July, 1957, and the Ithaca 


* 

See, for example, Minsky's article, ''Steps Toward Artificial Intelligence", 
Proc. I.R.E., 49, January, 1961, for an entertaining statement of the views of 
the loyal opposition, which includes an excellent bibliography. 
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program since September, 1959. A number of other agencies have contributed 
to particular aspects of the program. The Rome Air Development Center has 
assisted in the development of the Mark I perceptron, and we are indebted to 
the Atomic Energy Commission for making the facilities of the NYU computing 


center available to us. 


A great many individuals have participated in this work. R. D. Joseph 
and H. D. Block, in particular, have contributed ideas, suggestions, and 
criticisms to an extent which should entitle them to co-authorship of several 
chapters of this volume. I am especially indebted to both of them for their 
heroic performance in proofreading the mathematical exposition presented here, 

a task which has occupied many weeks of their time, and which has saved me from 
committing many a mathematical felony. Carl Kesler, Trevor Barker, David 
Feign, and Louise Hay have rendered invaluable assistance in programming the 
various digital computers employed on the project, while the engineering work 

on the Mark I was carried out primarily by Charles Wightman and Francis Martin 
at C.A.L. The experimental program with the Mark I was carried out by John 
Hay. In addition to all of those who have contributed directly to the research 
activities, the writer is indebted to Professors Mark Kac, Barkley Rosser, and 
other members of the Cornell faculty for their administrative support and encourage- 
ment, and to Alexander Stieber, W. S. Holmes, and the administrative staffs 

of the Cornell Aeronautical Laboratory and the Office of Naval Research whose 
confidence and support have carried the program successfully through its 


infancy. 


Frank Rosenblatt 
15 March 1961 
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1. INTRODUCTION 


The theory to be presented here is concerned with a class of 
" brain models" called perceptrons. By "brain model" we shall mean 
any theoretical system which attempts to explain the psychological function- 
ing of a brain in terms of known laws of physics and mathematics, and known 
facts of neuroanatomy and physiology. A brain model may actually be cons- 
tructed, in physical form, as an aid to determining its logical potentialities 
and performance; this, however, is not an essential feature of the model- 
approach. The essence of a theoretical model is that it is a system with 
known properties, readily amenable to analysis, which is hypothesized to 
embody the essential features of a system with unknown or ambiguous 
properties --in the present case, the biological brain. Brain models of 
different types have been advanced by philosophers, psychologists, biologists, 
and mathematicians, as well as electrical engineers (c.f., Refs. 17, 31, 33, 
54, 59, 61, 74, 91, 105, 109). The perceptron is a relative newcomer to this 
field, having first been described by this writer in 1957 (Ref. 78). Perceptrons 
are of interest because their study appears to throw light upon the biophysics of 
cognitive systems: they illustrate, in rudimentary form, some of the processes 
by which organisms, or other suitably organized entitites, may come to 
possess "knowledge" of the physical world in which they exist, and by which 
the knowledge that they possess can be represented or reported when occasion 
demands. The theory of the perceptron shows how such knowledge depends 
upon the organization of the environment, as well as on the perceiving 


system. 


x, 
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At the time that the first perceptron model was proposed, the 
writer was primarily concerned with the problem of memory storage in 
biological systems, and particularly with finding a mechanism which would 
account for the "distributed memory" and "equipotentiality" phenomena found 
by Lashley and others (Refs. 48, 49, 95). It soon became clear that the 
problem of memory mechanisms could not be divorced from a consideration 
of what it is that is remembered, and as a consequence the perceptron became 
a model of a more general cognitive system, concerned with both memory and 


perception... 


A perceptron consists of a set of signal generating units (or 
"neurons") connected together to form a network. Each of these units, upon 
receiving a suitable input signal (either from other units in the network or 


from the environment) responds by generating an output signal, which may 


be transmitted, through connections, to a selected set of receiving units. Each 


perceptron includes a sensory input (i.e., a set of units capable of responding 


to signals emanating from the environment) and one or more output units, which 


generate signals which can be directly observed by an experimenter, or by an 
automatic control mechanism. The logical properties of a perceptron are 


defined by: 


l. Its topological organization (i.e., the connections among 


the signal units); 


rae A set of signal propagation functions, or rules governing 


the generation and transmission of signals; 


3. A set of memory functions or rules for modification of 


the network properties as a consequence of activity. 


-4- 
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A perceptron is never studied in isolation, but always as part of a 
closed experimental system, which includes the perceptron itself, a defined 
environment, and a control mechanism or experimenter capable of applying 
well-defined rules for the modification, or "reinforcement" of the perceptron's 
memory state. In most analyses, we are not concerned with a single percep- 
tron, but rather with the properties of a class of perceptrons, whose topolo- 
gical organizations come from some statistical distribution. A perceptron, 
as distinct from some other types of brain models, or "nerve nets", is usually 
characterized by the great freedom which is allowed in establishing its 
connections, and the reliance which is placed upon acquired biases, rather 


than built-in logical algorithms, as determinants of its behavior. 


Because of a common heritage in the philosophy, psychology, 
physiology, and technology af the last few centuries, there are bound to be 
similarities between the points of view and the basic assumptions of the 
theory presented here, and of other theories. The writer makes no claim to 
uniqueness in this respect. In particular, the neuron model employed is a 
direct descendant of that originally proposed by McCulloch and Pitts; the 
basic philosophical approach has been heavily influenced by the theories of 
Hebb and Hayek and the experimental findings of Lashley; moreover, the 
writer's predilection for a probabilistic approach is shared with such theo- 


rists as Ashby, Uttley, Minsky, MacKay, and von Neumann, among others. 
This volume is divided into four main sections. Part I, 
commencing with this introduction, attempts to review the background, 


basic sources of data, concepts, and methodology to be employed in the 


study of perceptrons. In Chapter 2, a brief review of the main alternative 
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approaches to the development of brain models is presented. Chapter 3 
considers the physiological and psychological criteria for a suitable model, 
and attempts to evalute the empirical evidence which is available on several 
important issues. Sufficient references to the literature are included through- 
out these chapters so that the reader who requires additional background in 
any of the areas discussed can use this as a guide for further reading. Part I 
concludes with Chapter 4, in which basic definitions and some of the notation 
to be used in later sections are presented. Parts II and III are devoted to a 
summary of the established theoretical results obtained to date. In these 
sections, the strategy will be to present a number of models of increasing 
complexity and sophistication, with theorems and analytic results on each 
model to indicate its capabilities and deficiencies. Wherever possible, 
established mathematical results will be presented first, followed by empirical 
evidence from simulation and hardware experiments. Part II (Chapters 5 
through 14) deals with the theory of three-layer series-coupled perceptrons, 
on which most work has been done to date. These systems are called "mini- 
mal perceptrons". Part III (Chapters 15 through 20) deals with the theory of 
multi-layer and cross-coupled perceptrons, where a great deal still remains 
to be done, but where the most provocative results have begun to emerge. 
Part IV is concerned with more speculative models and problems for future 
analysis. Of necessity, the final chapters become increasingly heuristic in 
character, as the theory of perceptrons is not yet complete, and new 


possibilities are continually coming to light. 


Part I (except for the chapter on definitions) is entirely non- 
mathematical. In Part II, and most of the remainder of the text, familiarity 


with the elements of modern algebra and probability theory is assumed, and 
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should be sufficient for most of the material. In several proofs in Part II, 

and to a greater extent in Part II, analytic methods are employed, assuming 
knowledge of the calculus and differential equations; an elementary acquaintance 
with differential geometry would also be useful. Symbolic logic is not required 
here, but the student will find it necessary for reading much of the ancillary 
literature in the field. 


Several appendices are included which may prove helpful for 
cross-referencing equations, definitions, and experimental designs which 
are described in different chapters. Appendix A is a list of all symbols used 
in a standard mamner throughout the volume. Appendix B is a consolidated 
list of theorems and corollaries. Appendix C lists the principal equations 
used in the analysis of performance, and basic quantitative functions. Appendix 
A contains a summary of the experiments used for testing and comparing 
different perceptrons. These experiments are referred to by number, 


throughout the text, and are described in detail as they are first introduced. 
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2. HISTORICAL REVIEW OF ALTERNATIVE APPROACHES 


2.1 Approaches to the Brain Model Problem 


There are at least two basic points, which are fundamental to a 
theory of brain functioning, on which most of the present-day theorists seem 
to be in agreement. First is the assumption that the essential properties of 
the brain are the topology and the dynamics of impulse-propagation in a net- 
work of nerve cells, or neurons. This has been contested by a few theorists 
who hold that the individual cells and their properties are less important than 
the bulk properties and electrical currents in the cortical medium as a whole 
(c.f. Kohler, Ref 45). The "neuron doctrine", however, has now been 
accepted with sufficient universality that it need not be considered as an 
issue in this report (Bullock, Ref.i)}). It will be assumed that the essential 
features of the brain can be derived in principle from a knowledge of the 
connections and states of the neurons which comprise it. Secondly, there is 
general agreement that the information-handling capabilities of biological 
networks do not depend upon any specifically vitalistic powers which could 
not be duplicated by man-made devices. This also has occasionally been 
questioned, even today, by such neurologists as Eccles (Ref. 18) who 
advocate a dualistic approach in which the mind interacts with the body. 
Nonetheless, all currently known properties of a nerve cell can be simulated 
electronically with readily available devices. It is significant that the 
individual elements, or cells, of a nerve network have never been demons- 
trated to possess any specifically psychological functions, such as ''memory", 
"awareness", or "intelligence". Such properties, therefore, presumably 


reside in the organization and functioning of the network as a whole, rather 


Google 


than in its elementary parts. In order to understand how the brain works, it 
thus becomes necessary to investigate the consequences of combining simple 
neural elements in topological organizations analogous to that of the brain. 
We are therefore interested in the general class of such networks, which 


includes the brain as a special case. 


While there is substantial agreement up to this point, theorists 
are divided on the question of how closely the brain's methods of storage, 
recall, and data processing resemble those practised in engineering today. 
On the one hand, there is the view that the brain operates by built-in 
algorithmic methods analogous to those employed in digital computers, while 
on the other hand, there is the view that the brain operates by non-algorithmic 
methods, bearing little resemblance to the familiar rules of logic and mathe- 
matics which are built into digital devices (c.f. von Neumann, Ref., 105). The 
advocates of the second position (this writer included) maintain that new funda- 
mental principles must be discovered before it will be possible to formulate an 
adequate theory of brain mechanisms. It is suggested that probabilistic and 
adaptive mechanisms are particularly important here. This does not mean 
that the actual biological nervous system is strictly one type of device or 
the other; the issue concerns the matter of emphasis, as to whether the brain 
is primarily a more or less conventional computing mechanism, in which 
statistical or adaptive processes play an incidental and non-essential role, 
or whether the brain is so dependent upon such processes that a model which 
fails to take them into account will find itself unable to account for psycho- 


logical performance. 
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These two points of view are associated with two basically 
different procedures for studying the mechanisms of the brain and for the 
development of brain models. The first procedure will be called the mono- 
typic model approach; it amounts to the detailed logical design of a special- 
purpose computer to calculate some predetermined "psychological function" 
such as the result of a recognition algorithm, or a stimulus transformation, 
which ia postulated as a plausible function for a nerve net to calculate. The 
physical properties of this computer are then compared with those of the 
brain, in the hopes of finding resemblances. The second procedure will be 
called the genotypic model approach. Instead of beginning with a detailed 
description of functional requirements and designing a specific physical 
system to satisfy them, this approach begins with a set of rules for genera- 
ting a class of physical systems, and then attempts to analyse their perform- 
ance under characteristic experimental conditions to determine their common 
functional properties. The results of such experiments are then compared 
with similar observations on biological systems, in the hopes of finding a 
behavioral correspondence. It is the purpose of this chapter to review the 
historical development and current status of these two alternative "philo- 


sophies of approach" to the brain model problem. 


2.2 Monotypic Models 


In the monotypic model approach, the theorist generally begins 
by defining as accurately as possible the performance required from his 
model. For example, he may specify a data processing operation, an 


input-output or stimulus-response function, or a remembering and 


aie 
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regenerating operation. In one typical model, the system is required to 
normalize the size and position of a visual image, and to compare functions 
of this normalized image with certain stored quantities required for identifi- 
cation (Ref. 71). Given a description of che required performance in 
sufficiently precise terms, the theorist then proceeds to design a computing 
machine or control system embodying the required function, generally limiting 
himself to the use of a set of modular switching devices which are analogous 
to biological neurons in their properties. It is this last constraint which 
distinguishes the nerve net theorist from any other designer of special 
purpose computers confronted with the same problem. It is hoped that a 
network which consists of neuron-like elements, and is capable of computing 
the required functions, will be found to resemble a biological nerve net in its 


organization and the computational principles employed. 


While the simulation of animals, saints, and chessplayers by 
animated machines and clockwork devices goes back many centuries, the 
idea of constructing such devices out of simple logical elements with neuron- 
like properties is a relatively recent one, and received its first impetus from 
two sources: First, Turing's paper "On Computable Numbers", in 1936, and 
the subsequent development of stored-program digital computers by von 
Neumann and others during the 1940's (Refs. 12, 100)gave rise to an 
impressive family of “universal automata”, capable of executing programs 
which would enable them to perform any computation whatsoever with only 
the simplest of logical devices being employed as “building blocks". Second, 
the Chicago group of mathematical biophysicists which grew up about 


Rashevsky after the publication of his "Mathematical Biophysics" in 1938, 
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(Ref. 73) began to investigate the manner in which "nerve nets" consisting of 
formalized neurons and connections might be made to perform psychological 
functions. Householder, Landahl, Pitts,and others made notable contributions 


to this effort during the late 1930's and early 1940's (Refs. 35, 69, 70). 


In 1943, the doctrine and many of the fundamental theorems of this 
approach to nerve net theory were first stated in explicit form by McCulloch 
and Pitts, in their well-known paper on "A Logical Calculus of the Ideas 
Immanent in Nervous Activity". The fundamental thesis of the McCulloch- 
Pitts theory is that all psychological phenomena can be analyzed and understood 
in terms of activity in a network of two-state (all-or-nothing) logical devices. 
The specification of such a network and its propositional logic would, in the 
words of the writers, "contribute all that could be achieved" in psychology, 
"even if the analysis were pushed to ultimate psychic units or 'psychons', 
for a psychon can be no less than the activity of a single neuron... The ‘all- 
or-none' law of these activities, and the conformity of their relations to 
those of the logic of propositions, insure that the relations of psychons are 
those of the two-valued logic of propositions."' (Ref. 57). Despite the 
apparent adherence to an outdated atomistic psychological approach, there 
is an important contribution in the recognition that the proposed axiomatic 
representation of neural elements and their properties permits strict logical 
analysis of arbitrarily complicated networks of such elements, and that 
such networks are capable of representing any logical proposition whatever. 
As von Neumann states in a summary of the McCulloch-Pitts model, 

(Ref. 103) "The 'functioning' of such a network may be defined by singling 


out some of the inputs of the entire system and some of its outputs, and 
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then describing what original stimuli on the former are to cause what ultimate 
stimuli on the latter...McCulloch and Pitts' important result is that any 
functioning in this sense which can be defined at all logically, strictly, and 
unambiguously in a finite number of words can also be realized by such a 


formal neural network. "' 


A great variety of subsequent models have made use of this 
axiomatic representation, which we now refer to as the ''McCulloch-Pitts 
neuron''. As stated in the original paper (Ref. 57), the basic assumptions in 


this representation are: 


" 1. The activity of the neuron is an 'all-or-none' 


process. 


2. <A certain fixed number of synapses must be 
excited within the period of latent addition in 
order to excite a neuron at any time, and this 
number is independent of previous acitivy and 


position on the neuron. 


3. The only significant delay within the nervous 


system is synaptic delay. 


4. The activity of any inhibitory synapse absolutely 


prevents excitation of the neuron at that time. 


5. The structure of the net does not change with time." 
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These postulates are such as to rule out memory except in the form of 
modifications of perpetual activity or circulating loops of impulses in the 
network. Any non-volatile memory, such that the functioning of the network 
at a given time depends upon previous activity even though a period of total 
inactivity has intervened, is impossible in a McCulloch-Pitts network. 
However, a McCulloch Pitts network can always be constructed which will em- 
body whatever input-output relations might be realized by a system with 

an arbitrary memory mechanism, provided activity is allowed to persist in 
the network. 


Later writers, notably Kleene (Ref. 43) have considered in 
more detail the kinds of events which can be represented by networks of 
McCulloch-Pitts neurons. The only important limitation is that events 
whose definition depends upon the choice of a temporal origin point, or 
events which extend infinitely into the past, may not be representable by 
outputs from finite networks. Any event which can be described as one of 
a definite set of possible input sequences over a finite period of time can be 
represented. In particular, any events which might conceivably be recognized 
by a biological system can be represented by outputs of networks of McCulloch- 


Pitts neurons. 


In later papers by Pitts and McCulloch (Ref. 71) and by 
Culbertson (Refs. 16, 17) specific automata designed to perform actual 
"psychological" functions such as pattern recognition, have been described. 
Culbertson, in particulay, has carried out such designs in explicit detail for 


a large number of interesting problems. The approach which he advocates 
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is expounded in his 1950 work on "Consciousness and Behavior" as 


follows: 


"Neuroanatomy and neurophysiology have not yet developed 
far enough to tell us the detailed interconnections holding 
within human or animal nets...Consequently, ... we cannot 
start with specified nerve nets and then in a straightforward 
way determine their properties. Instead, it is the reverse 
problem which always occurs in dealing with organic behavior. 
We are given at best the vaguely defined properties of an 
unknown net and from these must determine what the structure 
of that net might possibly be. In other words,we know, at 
least in a rough way, what the net does (as this appears in 
the behavior of the animal or man) and from this information 
we have to figure out what structure the net must have...Our 
investigation passes through two stages. In the first stage-- 
the behavioristic inquiry--we ignore the inner constituents, 
i.e., the nervous system and its activity, and concentrate 
our attention instead on the observable relations between the 
stimuli affecting the organism and the responses to which 
these stimuli give rise...This makes the second stage--the 
functional inquiry--possible. Here, as Northrop says, we 
concentrate our attention on the inner (throughput) consti- 
tuents of the system and point out the ways in which the 
receptor cells, central cells, and effector cells could be 
interconnected so that the input and output relations. ..would 
be those discovered in stage 1." 


While such a program can hardly be criticized on logical grounds, 
it appears pragmatically to have fallen short of the proposed goals. Starting 
rather suddenly, with the development of automata theory in the late 1930's, 
the ready applicability of symbolic logic brought this approach to early 
mathematical sophistication. After the first flood of proposed models, 
further progress has been disappointingly trivial, and returns seem to be 
diminishing rapidly. The promised biological "explanations" have been 
particularly lacking. In this writer's opinion, there are at least five main 


reasons for this: 
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(1) 


(2) 


(3) 


There is a lack of sufficiently well defined psychological 
functions as a starting point. The approach requires 
essentially full knowledge of input-output relations for the 
behavior of an organism, and such knowledge is not 


available for any biological species. 


Constructed solutions generally show poor correspondence 
to known conditions of neuroanatomy and neuroeconomy; 
the numbers of neurons required often exceed those in 
biological nervous systems, and the logical organization 
generally requires a precision of connections which 
appears to be absent in the brain. In some cases, a 
single misconnection would be sufficient to make the 


system inoperable. 


The models fail to yield general laws of organization. 
A monotypic model is in general overdetermined, 
corresponding at best to a biological phenotype, 

rather than a species as a whole; its specification in 
the form of a detailed ''wiring diagram" frequently 
misses essentials in a plethora of detail. Unique 
solutions for the proposed functions are generally 
lacking and an enormous variety of models can be 
generated which appear to solve the same problem 
equally well. Therefore, unless the system is actually 
tested against its biological counterpart, nothing is 
gained by a detailed construction of the model except a 
further confirmation of an existence theorem which is 


already well established. 
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(4) The models lack predictive value. Once a particular 
model has been proposed, further analysis can reveal 
little that is not included in the functional description 


with which we began. 


(5) The models are not biologically testable in detail. 
Specific connections cannot be traced with sufficient 
precision in nervous tissue to say whether or not a 
particular wiring diagram is exactly realized. Conse- 
quently, the models are fated to remain purely specu- 
lative unless histological techniques are improved to 


a highly improbable degree. 


In the foregoing, we have concentrated on the line of models 
which have attempted to represent the brain as a symbolic logic calculator, 
in which events of the outside world are represented by the firing or non- 
firing of particular neurons. It is in these models that rigorous mathematical 
treatment has been most successfully achieved. Not all monotypic models 
are of this variety, however. Field theorists such as KdShler have taken 
exception to the idea that psychological phenomena can be represented in 
this fashion. Kéhler, arguing for an isomorphic representation of perceptual 
phenomena, asks (Ref. 46): "How can a cortical process such as that of a 
square give rise to an apparition with certain structural characteristics, if 
these characteristics are not present in the process itself? According to 
Dr. McCulloch, this is actually the case. But if we follow the example of 
physics, we shall hesitate to accept his view. In physics, the structural 
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characteristics of a state of affairs are given by the structural properties 

of the factors which determine that state of affairs... Situations in physics 
which depend upon the spatial distribution of given conditions never have 
more, and more specific, structural characteristics than are contained in 
the conditions". While Kéhler's own model is not generaly considered 
plausible today, his criticism is a significant one, and a number of theorists, 
such as Lashley (Ref. 50) MacKay (Refs. 55,56) and Green (Ref. 28) have 
been concerned with possible forms of representation of perceptual informa- 
tion which. would preserve the intrinsic structural features of the perceived 


event rather than merely assigning an arbitrary symbol to it. 


The main line of monotypic models, although failing to provide 
a satisfactory brain model, has left us a number of important analytic tools 
and concepts, including the McCulloch-Pitts neuron, and the theorems 
concerning the existence of networks representing arbitrary functions. For 
the actual design of plausible organizations, however, the genotypic approach 


appears to hold more promise. 


2.3 Genotypic Models 


In the monotypic approach, the properties of the components, 
or neurons, which comprise the networks are fully specified axiomatically, 
and the topology of the network is fully specified as well. In the genotypic 
approach, the properties of the components may be fully spetified, but the 
organization of the network is specified only in part, by constraints and 
probability distributions which generate a class of systems rather than a 
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specific design. The genotypic approach, then, is concerned with the 
properties of systems which conform to designated laws of organization, 


rather than with the logical function realized by a particular system. 


This difference in approach leads to important differences in 
the types of models which are generated, and the kinds of things which can 
be done with them. In the case of monotypic models, for example, the 
propositional calculus is applicable and probability theory is poorly suited 
to the analysis of performance, since a single fully deterministic system is 
under consideration which either does or does not satisfy the required 
functional equations. In dealing with genotypic models, on the other hand, 
sumbolic logic is apt to prove cumbersome or totally inapplicable (even 
though, in principle, any particular system which is generated might be 
expressed by a set of logical propositions). In the analysis of such models, 
the chief interest is in the properties of the class of systems which is 
generated by particular rules of organization, and these properties are 
best described statistically. Probability theory therefore plays a promi- 
nent part in this approach. A second major difference is in the method of 
determining functional characteristics of the models. In the monotypic 
approach, the functional properties are generally postulated as a starting 
point. In the genotypic approach, they are the end-objective of analysis, 
and the physical system itself (or the statistical properties of the class of 


systems) constitutes the starting point. This means that psychological 


functions need not be determined in full detail before setting out to construct 


a model, and, indeed, it is hoped that such models may help in answering 


open psychological questions. 
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While the monotypic approach arose rather suddenly with the 
advent of modern computers and control system theory, and rapidly advanced 
to a high level of mathematical sophistication, the genotypic approach has 
been much more gradual in its development, and has not yet developed all 
of the mathematical tools required to deal adequately with its problems. 

The genotypic models have been influenced less by the engineering sciences, 
and more by physiology and neuroanatomy. The descriptive anatomy of the 
nineteenth century laid the groundwork for modern studies of localization of 
function in the brain, and neurologists such as John Hughlings Jackson noted 
the apparent plasticity of the system -- the ability of neighboring regions to 
take over the function of damaged areas. Pavlov and others speculated about 
possible mechanisms for adaptive modification of the central nervous system 
in the early part of this century, and various hypotheses for the deposition of 
"memory traces" were of interest to psychologists and physiologists alike. 
The doctrine of equipotentiality, propounded by Lashley (Ref. 49), went even 
further in claiming complete interchangeability of most parts of the cerebral 
cortex, and evidence for "distributed memory" which suggested that "traces" 
must be more or less uniformly dispersed throughout the cortical tissue 
began to accumulate. All of this neurological evidence engendered a picture 
of the brain as a relatively undifferentiated structure, capable of undergoing 
radical reorganization by means of unspecified adaptive mechanisms, and 
showing only gross anatomical equivalence from one individual to another. 
While recent work on localization (Refs. 51, 65, 66, 94, 108) has shown 
some surprisingly precise mapping of functions, modern morphological 
investigations (Refs. 8, 52, 93) have borne out the apparently statistical 
Organization of the "fine structure" of neurons and their interconnections. 


It now seems reasonable to suppose that while there are many constraints 
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on the organization of neurons in the brain, which are undoubtedly essential 
to the system's functioning, these constraints take the form of prohibitions, 
biases, and directional preferences, rather than a specific blueprint which 
must be followed to the last detail. In order words, there are enormous 
numbers of functionally equivalent systems, all obeying the same rules of 
organization, and all equally likely to be generated by the genetic mechanisms 


of a particular species. 


While the neurologists mentioned above had a great deal to say 
about the observed and hypothetical organization of the brain, they were not 
concerned with the construction of models in the sense of detailed theoretical 
systems from which precise deductions could be made. Psychologists and 
philosophers, more willing to indulge in speculation, were the first to attempt 
detailed conjectures on the maturation of psychological functions in systems 
which might justifiably be called "brain models". Hebb (Ref. 33) and Hayek 
(Ref. 32), following the tradition of James Stuart Mill and Helmholtz, have 
attempted to show how an organism can acquire perceptual capabilities 
through a maturational process. For Hayek, the recognition of the attri- 
butes of a stimulus is essentialy a problem in classification, and his point 
of view has inspired Uttley (Refs. 101, 102) to design a type of classifying- 
automaton which attempts to translate the approach into more rigorous 
mathematical form. Hebb's model is more detailed in its biological 
description, and suggests a process by which neurons which are frequently 
activated together become linked into functional organizations called 
"cell assemblies" and "phase sequences" which, when stimulated, corres - 


pond to the evocation of an elementary idea or percept. While Hebb's 
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work is far more complete in its specification of a "model" than most 
preceding suggestions along this line, it is still too programmatic and too 
loose in its definitions to permit a rigorous testing of hypotheses. It should 
be considered more as a description of what a satisfactory model might 
ultimately look like than as a fully formulated model in its own right. None- 
theless, it comes sufficiently close to a detailed specification so that 
Rochester and associates, using an IBM computer, were able to propose 
enough of the missing detail to put the cell assembly hypothesis to an 
empirical test (Ref. 77). Unfortunately, with a theory so loosely specified, 
the inconclusive results of the IBM experiments carry little weight in 
evaluating Hebb's original system. Milner, in a recent paper (Ref. 58) has 
attempted to update the Hebb theory, and it may be that his model can be 
more readily translated into analyzable form, although this has not yet been 


done. 


It is interesting that one of the first applications of probability 
theory to brain models is due to Landahl, McCulloch, and Pitts, appearing 
in 1943 along with the McCulloch-Pitts symbolic logic mode] (Ref. 47). In 
this paper, the topology of the network is still assumed to be a strictly 
deterministic, fully known organization, but impulses are assumed to be 
propagated with known frequencies but with uncertainties in their precise 
timing. A theorem is stated which permits the substitution of frequencies 
for symbols in the logical equations of the network, in order to obtain the 
expected frequency with which different celle will respond. This statistical 
treatment is related to the work of von Neumann (Ref. 104) on the proba- 


bility of error in networks with fallible components. 
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The first systematic attempt to develop a family of statistically 
organized networks, and to analyze these in a rigorous fashion by means of 
a genotypic approach seems to have been due to Shimbel and Rapoport, in 
1948 (Ref. 92). Starting with an axiomatic representation of neurons and 
connections, similar to that of McCulloch and Pitts, a network is character- 
ized by probability distributions for thresholds, synaptic types, and origins 
of connections. A general equation is then developed for the probability that 
a neuron at a specified location will fire at a specified time, as a function of 
preceding activity and parameters of the net. This is applied to a number of 
specific classes of networks to determine the possibility of steady-state 
activity, and changes in the firing distribution with time. This work is a 
forerunner of a number of stability studies (e.g., Allanson, Ref. 2) which 


are still of interest. 


The use of a digital computer by Rochester and associates was 
mentioned above in connection with Hebb's model. Simulation of a statistically 
connected network to investigate possible learning capabilities was first 
carried out successfully by Farley and Clark in 1954 (Ref. 10). Although 
mathematical analysis was not attempted in either the Farley-Clark or the 
Rochester models, they illustrate a convenient method of axiomatizing a 
network (by means of a computer program) to a degree which makes the 
investigation of hypotheses possible. While none of these experiments led 
to very sophisticated systems, they are of considerable historical interest, 
and the mechanism for pattern generalization proposed by Clark and Farley 


(Ref. 15) is essentially identical to that found in simple perceptrons. 
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Statistical models of various types have been proposed during the 
last decade. In particular, the models of Beurle, Taylor, and Uttley (Refs. 6, 
99, 101, 102) are of interest as attempts to analyze models with a clear 
resemblance to the organization of a primitive nervous system, with receptors, 
associative elements, and output or motor neurons. Moreover, in some of 
these models, environments of sufficient complexity to permit the repre- 
sentation of visual and temporal patterns (albeit of a very primitive type) 
are included in the analysis. Minsky (Ref. 59) has also devised and analyzed 


several models capable of learning responses to simple stimuli. 


A contribution of considerable methodological significance was 
Ashby's "Design for a Brain", in 1952 (Ref. 3). While Ashby's work (despite 
its title) does not specify an actual brain model in our present sense, it 
develops the rationale for an analysis of closed systems which must include 
the environment as well as the responding organism and rules of interaction 
as the object of study. Ashby's fields of variables correspond closely to 
our concept of "experimental systems" which will be defined in Chapter 4. 
In addition to his conceptual contribution, which is concerned with the 
general approach to be used rather than with a specific model, Ashby has 
demonstrated in a number of experiments how statistical mechanisms can 


yield adaptive behavior in an organism. 


While the genotypic approach has found favor among many 


biologists, it is by no means universally accepted. A typical criticism is 
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voiced by Sutherland (Ref. 97) in connection with Hebb's system: 


‘When Hebb's theory was first put forward, it was hailed 
as showing how it might be possible to account for behavior 
in terms of plausible neurophysiological mechanisms... 
However, a moment's reflection shows that, if he is right, 
what he has really succeded in doing is to demonstrate 

the utter impossibility of giving detailed neurophysiological 
mechanisms for explaining psychological or behavioral 
findings. According to Hebb the precise circuits used in 
the brain for the classification-of a particular shape will 
vary from individual to individual with chance variation 

in nerve connectivity determined by genetic and matura- 
tional factors... Different individuals will achieve the 
same end result in behavior by very different neurological 
circuits... If Hebb's general system is right, it precludes 
the possibility of every making detailed predictions about 
behavior from a detailed model of the system underlying 
behavior." 


While objections such as this seem to stem from a misunderstanding 
of the possibility of obtaining seemingly deterministic phenomena from a 
statistical substrate (as in statistical mechanics) the above argument is bols- 
tered by many findings which suggest complicated hereditary mechanisms 
for the analysis of stimuli in "instinctive" behavior. The work of Sperry 
and Lettvin has already been cited in connection with the mechanisms for 
precise localization of connections which seem to exist in the brain. Our 
conclusion is that the biological system must employ some mixture of 
specific connection mechanisms and statistically determined structures; 
just how much constraint is present in the genetic constitution of the brain is 


an open question. 
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On most of the specific points of criticism raised in connection 
with monotypic models, the genotypic approach seems to fare much better. 
Detailed psychological functions are not required as a starting point. Detailed 
physiological knowledge of the brain would be helpful, but even a rough para- 
metric description enables us to start off in the right direction, and present 
models have a considerable way to go before they have assimilated all of the 


physiological data which are available. 


Since this approach begins with the physical model rather than the 
functions which must be performed, it is easy to guarantee its conformity in 
size and organization to the general characteristics of a biological system. 
Most important is the fact that this approach appears to be yielding results of 
increasing significance and interest, and the models frequently suggest 
progressive lines of development from simple first approximations to more 
sophisticated systems. In the application of the genotypic approach to per- 
ceptrons, a number of laws of considerable generality have been discovered, 


as will be seen in subsequent chapters. 


2.4 Position of the Present Theory 


The groundwork of perceptron theory was laid in 1957, and 
subsequent studies by Rosenblatt, Joseph,and others have considered a 
large number of models with different properties (Refs. 7, 30, 31, 40, 
41, 76, 79, 80, 81, 82, 84, 85, 86). Perceptrons are genotypic models, 
with a memory mechanism which permits them to learn responses to 


stimuli in various types of experiments. In each case, the object of 
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analysis is an experimental system which includes the perceptron, a defined 
environment, and a training procedure or agency. Results of such analyses 
can then be compared with results of comparable experiments on human or 
animal subjects to determine the functional correspondence and weaknesses 
of the model. A number of specific psychological tasks and criteria, which 
will be discussed in the following chapter, are used for the comparison of 


different systems. 


Perceptrons are not intended to serve as detailed copies of any 
actual nervous system. They are simplified networks, designed to permit 
the study of lawful relationships between the organization of a nerve net, the 
organization of its environment, and the "psychological" performances of which 
the network is capable. Perceptrons might actually correspond to parts of 
more extended networks in biological systems; in this case, the results 
obtained will be directly applicable. More likely, they represent extreme 
simplifications of the central nervous system, in which some properties are 
exaggerated, others suppressed. In this case, successive perturbations and 


refinements of the system may yield a closer approximation. 


The main strength of this approach is that it permits meaningful 
questions to be asked and answered about particular types of organization, 
hypothetical memory mechanisms, and neuron models. When exact : 
analytic answers are unobtainable, experimental methods, either with 
digital simulation or hardware models, are employed. The model is not 
a terminal result, but a starting point for exploratory analysis of its 


behavior. 
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3. PHYSIOLOGICAL AND PSYCHOLOGICAL CONSIDERATIONS 


In the last chapter, a methodological doctrine was proposed, 
which undertakes to evaluate classes of brainlike systems by comparing 
their performance with that of biological subjects in behavioral experi- 
ments; by gradually increasing the sophistication and varying the axio- 
matic constraints which define the experimental systems, it is hoped that 
models which closely resemble the biological prototype can ultimately be 
achieved. In this chapter, the desiderata for a satisfactory brain model 
are considered in more detail, from the standpoint of physiology and 
psychology. What are the parametric constraints, functional properties, 
and performance criteria which must be met, in order to achieve a model 


which is a plausible representation of the brain? 


The following discussion comes under three main headings: 
(1) established fundamentals; (2) current issues; and (3) the design of 
experimental tests of performance. It is not our purpose to review all of 
the relevant background in biology and psychology, but rather to highlight 
those points which bear most directly upon the present undertaking, and 
to suggest certain areas in which investigations might provide decisive 
evidence for or against some of the models which we shall propose. It 
will be noted that no attempt has been made to distinguish specifically 
“psychological” or specifically "physiological" problems in the following 
sections. Such distinctions are not only arbitrary in a number of the 
cases, considered, but also tend to obscure the fact that we are interested 
in all of these problems because of their relevance to brain models, rather 
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than to psychology or physiology per se. In this discussion, attention 
will be concentrated on the level of complexity which seems most commen- 
surate with that of the proposed models. Psychological material on psycho- 
neuroses, or on attitude formation, for example, while it might be brought 
to bear on the evaluation of some future models, is hardly likely to be 
relevant at this time. On the physiological side, we are chiefly concerned 
with the overall organization of the nervous system, its microstructure, 

and conditions for impulse transmissions; we are less concerned with 
details of neuroanatomy and neurochemistry, although such data may 
become important in more sophisticated models, where a closer correlation 


with the biological system is sought. 


3.1 Established Fundamentals 


3.1.1 Neuron Doctrine and Nerve Impulses 


It was only during the first decade of this century that a strong 
case was developed for regarding the neuron as the basic anatomical unit 
of the nervous system. The demonstration that this is the case rests largely 
upon the work of Ramon y Cajal (Ref. 14). Since Cajal's time, a great variety 
of neurons, differing in size, numbers of dendritic and axonal processes, and 
the distribution of these, have been described by neuroanatomists (Refs. 8, 
52, 93). Today it is generally accepted that in virtually all biological species, 
the nervous system consists of a network of neurons, each consisting of a 
cell body with one or more afferent (incoming) processes, or dendrites, and 


one or more efferent (outgoing) processes, or axons. The axons branch into 


-30- 


Google 


small fibers which may make contact with, but remain separate from the 
surface membrane of cells or dendrites upon which they terminate. Neurons 
are generally divided into three classes: (1) sensory neurons, which generate 
signals in response to energy applied to sensory transducers, such as photo- 
receptors or pressure sensitive corpuscles; (2) motor neurons, (or effector 
neurons) which transmit signals to muscles or glands and directly control 
their activity; (3) internuncial neurons, (or associative neurons) which form 
a network connecting sensory and motor neurons to one another. The brain, 
or central nervous system, is made up almost entirely of neurons of this 


last type. 


The actual signals carried by these neurons may take one of 
several forms. Until recently, it was supposed that all information in the 
nervous system was represented by a code of all-or-nothing impulses, 
corresponding to on-off states of the neurons. A sufficient input signal was 
supposed to trigger the receiving cell directly into emitting a spike potential, 
which was transmitted without decrement from the receiving region of the 
dendrites to the cell body, and out along the axon to the terminal endbulbs, 
where it might or might not succeed in triggering later cells in the network. 
In a recent review (Ref. 11) Bullock has pointed out that this view has been 
largely supplanted by a far more complicated picture. While it is true that 
the transmission of signals over long distances is generally accomplished 
by means of all-or-nothing spike propagation along the axons of nerve cells, 
the spike impulse is not a direct response to impulses which arrive at the 


dendrites, and may originate at a point which is separated by a considerable 
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distance from the site at which incoming impulses are received. Essentially, 
the currently accepted concept is that the dendritic structure and cell body 
jointly act as an integrating system, in which a series of incoming signals 
interact to establish a pre-firing state in a region at the base of the axon, 
from which impulses originate. If this pre-firing state reaches a threshold 
level (presumably measured by membrane depolarization) at a point within 
the critical region, a spike potential is initiated, and spreads without decre- 
ment along the axon. The interactions which may occur in the cell body and 
dendrites, however, involve potential fields in which the effects of impulses 
received at a given point spread over the surrounding membrane surface in 

a decrementing fashion. These effects may be graded in intensity, depending 
on frequency of impulses received, and the state of the receiving membrane 
at the time. Successions of impulses arriving at the same synapse can 
sometimes cause an increase in the sensitivity of the receiving membrane 
(facilitation) and can sometimes cause a progressive diminution in sensitivity 
(Ref. 11). There is evidence to suggest that different local patches of surface 
membrane are differently specialized, and respond in different ways to 
impulses received, even within the same neuron. Some of these regions 
appear to act as sources of internally generated signals, which may lead 

to spontaneous activity of the neuron, and the emission of spike impulses 


without any input signals from outside the cell. 
Two main types of synapses are recognized: excitatory and 
inhibitory. It is generally assumed, although it has not been proven, that 


a single neuron is either all excitatory or all inhibitory, in its effect upon 


post-synaptic cells. It remains possible, however, that the individual 
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synaptic endings are specialized, some of them releasing a depolarizing 
transmitter substance (excitatory endings) while others release a hyper- 
polarizing substance (inhibitory endings). A single synapse, so far as 
is known, remains either excitatory or inhibitory, and is incapable of 


changing from one to the other. 


The nerve impulse itself is a basically non-linear response to 
stimulation. It is supported by energy-reserves of the axon by which it is 
transmitted, rather than by a propagation of energy from the sources of 
excitation. The nerve impulse is manifested by a moving zone of electrical 
depolarization of the surface membrane of the neuron, the exterior of which 
is normally 70 to 100 millivolts positive relative to the interior. This zone 
tends to spread along the axon due to ionic currents which tend to break 
down the potential difference between the interior and exterior of the 
neuron, until the membrane is repolarized by metabolic processes (see 
Eccles, Refs. 18, 19). The resulting "spike potential" takes the form of 
an electrically negative impulse (measured relative to the normal surface 
potential of the membrane) which propagates down the fiber with an average 
velocity of about 10 to 100 meters per second, depending on the diameter 


of the fibers (c.f., Brink, Ref. 9). 


The arrival of a single (excitatory) impulse gives rise toa 
partial depolarization of the post-synaptic membrane surface, which 
spreads over an appreciable area, and decays exponentially with time. 
This is called a local excitatory state (l.e.s.). The l.e.s. due to 
successive impulses is (approximately) additive. Several impulses 


arriving in sufficiently close succession may thus combine to touch off 


33-6 


Google 


an impulse in the receiving neuron if the local excitatory state at the base 
of the axon achieves the threshold level. This phenomenon is called 
temporal summation. Similarly, impulses which arrive at different points 
on the cell body or on the dedrites may combine by spatial summation to 
trigger an impulse if the l.e.s. induced at the base of the axon is strong 


enough. 


The passage of an impulse in a given cell is followed by an 
absolute refractory period during which the cell cannot be fired again, 
regardless of the level of input activity. This is equivalent to an infinite 
threshold during this period. The spike potential and absolute refractory 
period last about 1 millisecond. Finally, there is a relative refractory 
perjod which may last for many milliseconds after the initial impulse. 
During this time, the threshold gradually returns to normal, and may 
even fall to somewhat below its normal level for atime. While the 
response of a cell to a single momentary stimulus, such as an electrical 
pulse, is markedly non-linear (the amplitude of the generated impulse 
being quite independent of the amplitude of the triggering signal) the 
effect of a sustained excitatory signal, in many cases, is to evoke a 
volley of output spikes, the frequency of which may be roughly propor- 
tional to the intensity of the stimulus over a wide range. This is parti- 
cularly true of sensory neurons, where the frequency of firing may be 
used to determine the intensity of the stimulus energy with considerable 


accuracy. 
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The general picture of the nervous system, then, is one of a 
large set of signal generators, each having one or more outputs, on which 
nerve impulses may appear. These impulses may vary in frequency, and 
to some extent in amplitude, but seem to carry information mainly ina 
pulse-coded form. The signal generators themselves are decision elements 
of a most intricate type; each one makes its decision to initiate an output 
impulse according to a complicated function of the series of signals received 
at each of its synapses or receptor areas, as well as its own internal state. 
In a brain model, a neuron of this complexity would tend to make the system 
unintelligible and unmanageable with the analytic and mathematical tools 
at our disposal. Simplifications will therefore be introduced, as in the 
manner of the McCulloch-Pitts neuron; but it should be remembered that 
the biological neuron is considerably more complicated, and may incorporate 
within itself functions which we require whole networks of simplified neurons 


to realize. 


3.1.2 Topological Organization of the Network 


The human brain consists of some 19/9 neurons of all types. 
These are arranged in a network which receives inputs from receptor 
neurons at one end, and conveys signals to the effector neurons at the 
output end. Different sensory modalities -- vision, hearing, touch, etc. -- 
communicate with the central nervous system by way of distinct nerve 
bundles, which enter it at different points. Each of these modalities, 
after passing its information through a network of cells which respond 


more or less exclusively to stimuli from that modality, eventually contri- 
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butes to a common pool of activity in the "association areas" of the central 
nervous system (CNS). Output signals originate either from the parts of 
the CNS which are specific to a particular modality (for example, the 
pupillary reflex mechanism) or from the common activity areas (as in 
speech). Final outputs may go through a series of stages in which motor 
patterns or sequences are selected, and detailed coordination is regulated. 
From these motor control regions, feedback paths re-enter the association 
areas and sensory integration areas, so that the possibility of an elaborate 


servo-mechanism for the control of motor activity exists. 


While this general picture holds true for most biological 
organisms, there is considerable variation both in gross and detailed 
anatomy, from species to species and individual to individual. In under- 
taking to design a first order approximation to this structure for use in a 
brain model, we will begin with a network consisting of a single array of 
sensory units, a layer of association units, and a single effector, or 
response unit. In later models, more complicated structures will be 
considered. Even the simplest models, however, are capable of showing 
a surprising similitude to the functional properties of the brain. It seems 
reasonable, therefore, to regard the complications of neuroanatomy in the 
various species as elaborations of a basically simple schema, which is to 
be found throughout. This basic plan of organization is illustrated in 


Figure 1. 
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The distribution of cell types and connection patterns has been 
studied by Lorente de NG, Sholl, Bok, and others (Refs. 8, 52, 93). A 
typical cell in the cerebral cortex receives input connections from some 
hundreds of other cells, which may be located in widely scattered regions, 
but its output is more likely to be transmitted to a relatively localized 
region. Cells which receive sensory input signals are likely to have a 
restricted field of origins in a sensory surface, such as the retina or 


the skin. 


The mapping of the frog retina into the brain has been studied 
by Lettvin (Ref. 51) who finds a rather precise topographic mapping, in 
which several different types of information are represented in different 
inversa This topographic mapping is established genetically despite 
the fact that the fibers which transmit the information from the retina 
are apparently completely "scrambled" in the optic nerve. Moreover, 
experiments by Sperry (Ref. 94) and more recently by Lettvin (Ref. 51) 
show that if the optic nerve is severed and allowed to grow together again, 
the fibers which originally transmitted to a particular terminal location will 
tend to reconnect to that same terminal location, with surprisingly little 
loss of precision. This points to a highly specific neural organizing 
capability, which must be taken into account in considering admissible 
types of constraints for a brain model. In the mammalian brain, each 
sensory modality appears to be represented by an orderly topographic 
mapping analogous to that just described. Auditory stimuli, for example, 
are mapped into a region which is organized according to pitch; tactile 


stimuli are mapped according to body location, and so forth. Similarly, 


See also Section 3.1.4. 
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the motor neurons are organized, in the cerebral cortex, in an ordered 
arrangement which is topologically similar to the organization of the 


muscles which are controlled. 


In contrast to the highly specific regional organization in the 
gross anatomy of the sensory projection areas of the cortex, the detailed 
microstructure of the network appears to be essentially random, governed 
only by directional gradients and preferences, and statistical distributions 
of fiber lengths for various types of cells (see Sholl, Ref. 93). In the 
human nervous system, it appears that the most specific and constrained 
topological organizations are to be found in the sensory and motor systems, 
while the intervening association network of the CNS is less tightly 
controlled in its organization, presumably depending more on learning 
and adaptive modification to establish the required pathways and linkages. 
The degree of precision in establishing the topological organization of 
neurons in even the most highly constrained reflex mechanisms is probably 
far less than that in most artificial data processing devices, and must retain 
a certain degree of randomness wherever the number and density of 
connections is appreciable. Unfortunately, no data are available which 
would indicate the complexity of topological constraints which correspond 
to the highly complex inherited behavior patterns which are known to 
exist in many species. Since the nature of such constraints is unknown, 
we shall avoid gratuitous assumptions about them, as far as possible. 

In the development of brain models, it will be our general strategy to start 
out with minimally constrained networks, and examine the consequences of 


introducing particular types of constraints, one at a time. 
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3.133 Localization of Function 


Ever since the brain was first credited with the control of 
psychological activity, attempts have been made to delineate separate 
functions for its different parts. In the last century (largely under the 
influence of Gall) this took the form of an assignment of ''mental faculties" 
such as intelligence, combativeness, amativeness, and religiosity, to 
special regions of the brain. As techniques for the study of functional 
anatomy improved, this gave way to a concept of organization into sensory 
tracts, motor tracts, and association tracts. The functional organization 
which was revealed has been most firmly established in the case of sensory 
and motor tracts, where a particular position in the brain is correlated with 
a particular sensory locus, or a particular set of muscles whose activity it 
controls. An excellent review of sensory and motor mapping can be found 
in Ruch (Refs. 88, 89). More recently, a finer breakdown in the localization 
of sensory functions has been demonstrated by Lettvin and associates (Ref. 51). 
Four distinct types of information, involving distinct aspects of the visual 
stimulus (contrast, curvature, movement, and dimming of illumination) have 
been shown to be mapped into four distinct layers of the tectum of the frog. | 
This suggests localization of analytic functions, of a sort which has been 


suspected but not previously demonstrated. 


In dealing with the so-called "association areas" of the cerebral 
cortex, and with other parts of the brain which are not clearly related to 
sensory data processing or motor coordination, something of the old 
treatment in terms of ''mental faculties" still remains; specifically, 


centers have been found which are commonly attributed with primary 
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responsibility for temporary and permanent memory, for emotional behavior, 
for speech recognition and speech production, and (in the frontal lobes) for 
the integration of complex goal-directed activities. The lack of clear opera- 
tional tests for such capabilities has been a hindrance to progress in such 
functional mapping, and the results are considerably more ambiguous than 

is the case with sensory and motor functions. A discussion of current 
evidence on brain localization with respect to these "higher faculties" is 
found in Pribram (Ref. 72). Much of the recent work is concerned with the 
localization of tracts which influence motivation, alertness, and conscious- 


ness in the organism (Refs. 1, 22, 38, 64, 65). 


One feature which is of particular importance for brain models 
is the apparent plasticity of localization in the "association areas" (or 
"intrinsic systems", to use the terminology advocated by Primbram) in 
contrast to the relatively fixed and irreplaceable character of the sensory 
and motor tracts. Loss of function, due to destruction of association cortex, 
is apt to be transient, with adjacent areas taking over the function after a 
period of readaptation. Jackson, in his classic studies of the motor cortex, 
(Ref. 36) observed that even here localization is not rigid and absolute, and 
that a certain amount of flexibility exists, permitting the functions of damaged 
tissue to be taken over by neighboring areas. The sensory projection areas, 
on the other hand, appear to be indispensible to perception; destruction of 
the optical cortex leads to permanent blindness in an area corresponding to 
the location of the lesion, and similar phenomena are to be found in other 
sensory modalities. Thus, the extreme hypothesis of equipotentiality 


advocated originally by Lashley (Ref. 49), (who observed that cortical 
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ablation appeared to produce a general deficit in performance proportional 

to the amount of cortex extirpated, rather than eliminating specific memories 
and abilities) has been modified in the direction of relative localization, 
which is quite strict for certain sensory functions, and comparatively weak 
and readily modified . for more complicated control functions, thinking, and 


memory. 


A tather different approach to localization is suggested by the 
histological studies of cortical tissue, initiated originally by Brodmann, and 
pursued more recently by Lorente de N6 and Sholl (Refs. 52, 93). The 
"cytoarchitectonic areas" which have been described in these studies differ 
in their microstructure and detailed organization, and attempts have been made 
to relate such differences to the function of the cortex in which they occur. 

To date, this approach has not led to particularly significant results, although 
in principle it may ultimately suggest the essential organizational properties 


which must be incorporated into a brain model. 


At the primitive level of organization to which our models will 
aspire at this time, current data on brain localization are of only secondary 
interest. The main features of the brain still seem to be adequately 
described by the general topological structure shown in Fig. 1. The 
"central integration and control network" indicated in the diagram is known 
to possess some important internal demarcations in higher arganisms, but 
the precise functions of these parts and their interrelations is still largely 
speculative. In simpler brains (crustacea, for example) the gross 
organization is probably no more complex than indicated by the diagram; 
and it seems likely that in general it is the fine structure, rather than the 


gross anatomy, which determines the functional properties of the network. 
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3.1.4 Innate Computational Functions 


There is no doubt that mechanisms of considerable complexity, 
sufficient for perceptual tasks and the control of organized behavior, can 
be created by genetic control of growth and maturation. This is most 
dramatically evident in the instinctual patterns of insects (for example, 
the well known communication system of bees, and the frequently cited 
behavior patterns of carpenter wasps), but is also clearly present in 
vertebrates (e.g., the spawning behavior of salmon, and the migratory 
behavior of birds, as described in Ref. 90). Recently, Gibson and Walk 
have furnished clear experimental evidence for the innate perception of 
depth in mammals (Ref. 24). All of these phenomena require "built-in" 
control mechanisms, of a rather intricate sort. In the cases just cited, 
these built-in mechanisms are not known in any detail. A number of more 
elementary functions have been discovered, however, which provide some 
picture of the types of "computational mechanisms" which are likely to 


exist throughout the central nervous sytem. 


The stimulus analyzing mechanisms discovered by Lettvin and 
associates for frog vision have already been mentioned. In these studies, it 
is found that certain ganglion cells in the frog retina respond only to contours 
or strong contrast gradients within their sensory field; others respond only to 
convex images; others to moving boundaries; and still others to a general 
dimming of illumination over their entire field. Each of these four cell types 
transmits its information to a distinct layer of the frog's tectum, where its 


position is mapped topographically. Thus, one layer represents a contour 


Other visual analyzing mechanisms have recently been demonstrated by 
Hubel and Wiesel (Ref. 113) in the cat's cortex (see Chapter 23). 
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map, or outline drawing of the stimulus field, another represents a location 
map for small convex objects or corners, a third represents movement 


vectors, and a fourth indicates regions of dimming illumination. 


At the motor-control end of the nervous system, a number of 
reflex arcs and servo-control systems have been analyzed. The pupillary 
reflex, for example, has been analyzed as atypical servomechanism by 
Stark and Baker (Ref. 96). A considerable amount of work has also been 
done on the cerebellar servomechanisms which regulate muscular action 
under the control of cortical decisions and kinesthetic feedback information 
(c.f. Ruch, Ref. 89). It is probably safe to assume that similar closed-loop 
control systems, employing familiar servomechanism principles, are 
employed throughout the central nervous system for such purposes as 
controlling level of activity, preventing runaway excitation phenomena 
(such as occur in epileptic seizures), and regulating sensitivity to selected 


aspects of the sensory input data. 


It is worth noting that most of the specific computing mechanisms 
used in muscular control appear to be of an analog variety, rather than digital; 
they make use of intensities and frequencies of activity for the direct control 
of servo-systems, rather than computing a control formula from encoded 
data and then generating the control signal required. The stimulus analyzing 
mechanisms found by Lettvin, however, constitute a sort of digital code, in 
which stimulus properties are represented by presence or absence of signals 
from particular neurons. It seems likely, as von Neumann has observed 
(Ref. 105) that the brain makes extensive use of both digital and analog 
principles in its operation, and it appears that both types of devices may 


be genetically determined. 
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An interesting example of theoretical speculations on possible 
computational functions employed in shape discrimination in the octopus can 
be found in Sutherland (Ref. 98). Sutherland reviews several alternative 
theories, and presents evidence in support of his own conjecture that the 


octopus responds to an analysis of the horizontal and vertical dimensions 


_ of the stimulus measured along all possible cross-sections. No attempt is 


made, however, to tie the computational process to a particular neurological 


structure, or to indicate a mechanism which might carry out the indicated 


operations. 


3.1.5. Phenomena of Learning and Forgetting 


Thus far, we have concentrated on the anatomical and physio- 
logical features of the nervous system which appear to be basic for the 
design of a brain model: We now turn to some of the behavioristic and 


psychological functions which a brain model should be able to demonstrate. 


Phenomena of retention and adaptation in organisms have been 
studied in a variety of experiments, varying greatly in their design. In 
traditional usage, "memory" experiments have been concerned more with 
the retention and recall of experience, while "learning" experiments are 
concerned with the acquisition and modification of behavior. Both types of 
investigation, however, are concerned with lasting modifications in the state 
of the organism, and in complicated problems (e.g., those involving 
"insight"') one tends to merge into the other; accordingly, all of these 


experiments will be considered together in this discussion. 
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Quantitative studies of learning and memory in psychology 
stem from the classical experiments of Ebbinghaus, in 1885, on the learning 
and retention of nonsense syllables. Using himself as a subject, he obtained 
learning and forgetting curves, and demonstrated many of the phenomena of 
recognition and retention which have interested psychologists ever since. 
Related phenomena have been studied by Bartlett (Ref. 5 ) using more highly 
organized material. A second type of experiment, the conditioned reflex 
experiment, first employed by Pavlov, is characterized by the association 
of an existing response to a new stimulus, which did not evoke the response 
prior to the conditioning procedure. A third type of experiment, employed 
originally by Thorndike and recently studied extensively by Skinner and 
others, is concerned with the learning of a pattern of behavior which is 
instrumental to the solution of a problem, or which satisfies a drive. 
Where such problem-solving behavior appears to depend in a crucial way 
upon a ''cognitive restructuring" of the situation, or the formation of a new 
"“concept'', we have an experiment in "insight" or ''concept formation", as 


in the studies of the Gestalt psychologists. 


It is possible that these three types of experiments are actually 
demonstrating fundamentally different mechanisms of learning. The first 
deals with recognition and recall of previous perceptual experience; the 
second is concerned with the generalization of responses from initial 
stimuli to new stimuli by virtue of temporal association; the third is 
concerned with the discovery and establishment of problem-solving behavior. 
Still other experiments deal with such phenomena as short-term memory 


span, acquisition of needs and motives, attitude formation, perfection of a 
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motor skill, or learning to make fine perceptual judgements. Undoubtedly, 
the same physiological processes are tapped in many of these tasks; on the 
other hand, attempts at subsuming all of them under a set of general "laws 

of learning" does not seem to be particularly helpful for our present purpose. 
From the standpoint of brain model construction, it seems safest to regard 
each type of learning experiment as a distinct problem, with its own variables 
and rules of behavior which we hope that our model will duplicate under 
equivalent experimental conditions. The main value of such psychological 
experimentation, then, is to provide us with a set of "calibration experiments", 
by means of which a model can be compared with known organisms under well 
defined conditions. The reader who is unfamiliar with the literature of 
learning experimentation will find the reviews by Hilgard, Brogden, and 
Hovland (in Ref. 112 ) particularly helpful. 


In a number of experiments, attempts have been made to find 
the actual physiological correlates of the learning or memory phenomenon. 
Notable among these are the experiments of Penfield (Ref. 68), who finds 
that electrical stimulation of selected points on the cortex may evoke long 
and vivid sequences of past experience, apparently with hallucinatory clarity. 
John (Ref. 39) has recently reviewed experiments in cortical conditioning, and 
reported a number of interesting results of his own, which suggest that 
memory may involve modification of the connections between the deep centers 
of the brain stem and the cerebral cortex, with the reticular formation playing 
a particularly significant role. The experiments of Olds (Refs. 64, 65, 66) 
on the reinforcing effects of electrical stimulation applied to certain points 
in the hypothalamus and adjacent structures suggest that these may be 
involved in the motivational aspect of learning. Such experiments, which 
have only recently become possible through the improvement of electro- 
physiological techniques, are likely to become increasingly valuable as 


guides to theory construction. 
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3.1.6 Field Phenomena in Perception 


Early studies of perception were largely concerned with the 
absolute question of what perceptions are made of; such studies were 
concerned with range and sensitivity of sensory abilities, measurement of 
limits and thresholds, and the detailed dissection of sensory stimuli into 
fundamental components. Such studies form the main subject matter of 
classicial psychophysics. In psychology, they gave rise to an atomistic 
approach (reaching its utlimate expression in the work of Titchener) in 
which it was proposed that any phenomenon of perception could be accounted 
for by a proper compounding of sensory elements, each of which retains its 
own identity, like a piece of tile in a mosaic. During the last few decades, 
largely under the influence of the Gestalt psychologists, studies of perception 
have turned from the question of the constituents of perception to the question 
of the conditions under which a given perception occurs. It is now generally 
accepted that what is perceived depends not only upon the properties of the 
stimulus object, or image, which is recognized, but upon the organization 
of the entire sensory field in which it is embedded. This is true not only 


in vision, but in other sensory modalities as well. 


The field phenomena which have been studied include the effects 
of contrast, figure-ground organization, frames of reference, depth perception, 
size constancy, and illusions. The reader is referred to Koffka (Ref. 44 ) 
and Gibson (Ref. 26 ) for detailed discussion of these topics. For present 
purposes, the most important implication of this work is that a physical 
model for a perceiving system must permit the interaction of all elements 
in a spatially organized field. It is not sufficient simply to detect sets of 


elements which represent a ''pattern''; the perception of a pattern, and the 
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interpretation of it, depends in a fundamental way on metric relationships 

to other sense data from the same modality, and correlations with sensory 
data from entirely different modalities. The perception of a line as "upright", 
for example, depends on its observed angles relative to visual standards of 
"uprightness"', such as the corners of a room, and also upon the gravity 
senses and kinesthetic data which provide a frame of reference for "up" 

and "down". The decision that two disjoint patches of illumination represent 
parts of the same object rather than different objects depends upon their 
contrast or resemblance to the field structure around them, as well as on 
their relationship to one another. It is possible (as Gibson has suggested) 
that recognition is never achieved, in biological systems, by the representation 
of a particular receptor configuration, but only by the representation of sets 

of relations (angles, ratios, etc.) as its elementary data. If this is the 

case, a suitable set of analyzing mechanisms, capable of measuring such 
variables must be included in the pre-recognition tracts of a brain model. 

As our models gain in sophistication, it is, in fact, becoming increasingly 
apparent that such analyzing mechanisms are essential for purposes of 


efficiency and economy of design. 


The perceptrons to be considered initially will not possess 
intrinsic field-organization properties. With the introduction of cross- 
coupled systems, such properties begin to emerge. An evaluation of 
these systems by means of typical ''Gestalt perception experiments" has 
barely begun at the present time, but represents one of the most important 


tasks to be undertaken. 
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3.1.7 Choice-Mechanisms in Perception and Behavior 


Selective attention and ''set'' are fundamental phenomena in 
the control of psychological activity. They indicate mechanisms for 
choosing between alternative courses of action, or points of view, and 
play a logical role analogous to the selection of different branches in a 
"flow diagram" of a digital computing routine. Attention and psychological 
set are largely determined by the situational context in which behavior 
occurs, and by the current ''goals"' or "purposes'' of the organism, which 
may be thought of as choices of a superordinate sort, under which sub- 
decisions are made to select particular modes of activity. For example, 
an individual who is set to ldok for a word in a dictionary will be most 
attentive to the sequence of letters in boldfaced type, while someone who 
is looking for torn pages will probably be unaware of the particular letter 
combinations, and someone who is simply scanning the volume to look for 
pictures is apt to notice neither the spelling nor the condition of the 


pages. 


The importance of set, or attitude, for learning has been 
emphasized by Hebb (Ref. 33), but choice mechanisms of this type have 
rarely been incorporated in the detailed design of theoretical brain 
models. In purely logical models of behavior, they play a considerably 
more prominent role -- for example, in Tolman's learning theory, and 
in Newell and Simon's models for problem solving behavior (Refs.. 62, 63), 
selective choice-mechanisms are specifically designated. Ina brain 
model, it is clear that such phenomena must be closely related to the 


problem of "temporary memory", since the set under which the brain 
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is currently operating must be represented by a temporarily stable, but 
nonetheless readily altered, state of the system, capable of modifying: 
processes which go on while it persists. It seems likely (although un- 
supported by any direct evidence) that pools of neurons connected by 
reverberating circuits may be important set-maintaining devices in the 
nervous system, exerting their influence on the brain as a whole by 
means of a widely distributed barrage of sub-threshold excitation or 
inhibition. The plausibility of such mechanisms will be considered in 


more detail in a later chapter. 


3.1.8 Complex Behavioral Sequences 


The discussion of psychological sets and choice mechanisms 
brings us to a consideration of even more highly organized behavior and 
thought patterns, such as the steps taken in performing an arithmetic 
computation, or driving to work, or performing a piece of research. 

All of these activities represent orderly sequences of decisions and action, 
and can be considered, as Newell and Simon have suggested, as programs 
to be performed. In some cases, these programs are highly stereotyped, 
and determined by rigid rules; in other cases, they employ chance 
mechanisms and heuristic procedures. Much of the classical psychological 
literature on problem solving and insight is relevant to this second class 

of programs, while a rat running a maze might be considered an example 
of the first type. As in the case of selective attention and set, these 
problems have not been dealt with in detail by any brain models proposed 
to date, but it seems likely that at this level the brain and the computer 


begin to approach a common meeting ground. Problems of memory span, 


-5]- 


Google 


storage, and sequence control are present in both types of systems, and 
many of the logical problems confronted in "heuristic programming" 
(Refs. 60, 62, 63 ) seem to be direct translations from human problem- 
solving experience to the language of computing machines. This does 

not mean that the physical structure of a brain model must ultimately 
resemble that of digital devices, but rather that the same basic logical 
organization -- a memory for programs, a memory for data, anda 
mechanism for the sequential performance of a given program -- must be 
available. The 'programs'' themselves presumably take the form of 
sequences of selective sets, or bias states, arranged in a heirarchical 
manner, so that sub-operations are performed under the control of a 
"master set'' or ''master program" which determines the overall plan of 
activity. While the detailed properties of such systems must necessarily 
remain speculative at the present time, we shall see that such a concept 
is compatible with the organization of perceptrons not too far removed 


in complexity from those which we are now capable of analyzing. 
3.2 Current Issues 


While the discussion of the preceding section has attempted 
to stick to a relatively conservative and uncontroversial rendition of 
physiology and psychology as it applies to the brain model problem, it 
is clear that in the last pages we have been drawn into increasingly 
speculative and uncertain areas of discourse. In this section, an 
attempt will be made to highlight a number of issues which seem most 
salient in determining the fate of various brain models, and which are 


not answerable at the present time outside the realm of speculation. 
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Of necessity, a physical model will have to take a stand on most of these 
issues, and it is possible that by investigating the logical consequences of 
such a stand, a decision as to the plausibility of various alternatives might 
be made; the brain model approach has a chance, here, of providing answers 
which empirical studies have so far been unable to discover. In any event, 
the decision taken on these issues represent the points at which a brain 


model is most vulnerable to future attack, as new evidence is uncovered. 


3.2.1 Elementary Memory Mechanisms: 


The status. of current information on basic memory mechanisms 
in the nervous system has been reviewed recently by Burns (Ref. 13). Most 
brain models employ some memory hypothesis, but evidence as to the nature 
of actual physiological mechanisms which might be involved is almost 
totally lacking. It is generally agreed, simply on the basis of definition, 
that whatever we call "memory" involves a modification of neural activity 
in the central nervous system or its output signals, as a function of 
exposure to previous events or "experience". In some models, this 
modification has been attributed to persistent activity in closed loops of 
neurons, but most theorists are now agreed that, while such a memory 
mechanism might account for "short term memory", and might play a 
significant role in the establishment of more permanent memory traces, 
there must also exist a non-volatile memory mechanism (e.g., a 
structural or chemical change) which can outlast periods of neural in- 
activity, and is relatively insensitive to transient activity in the nervous 
system (see Hebb, Ref. 33, pp. 12-16). The nature of this memory trace 


mechanism, it is generally agreed, must be such as to facilitate the use 
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or selection of neural pathways which have been active at the time of the 
"remembered" experience or behavior, and virtually all specific models 
assume that it takes the form of a facilitation of connections between sources 
of excitation and responding neurons in the motor system or CNS. In 
making such an assumption, the influence of the conditioned reflex model, 
which suggests that sensory neurons become coupled to association neurons, 
by which they are connected to motor neurons, is clearly evident. An 
alternative position, in which the preferred pathways ''win out" by surviving 
deteriorative changes in unused pathways, rather than by active facilitation, 
has not been explored to any significant degree, but appears to be logically 
similar to its potentialities. 


Granting that the memory mechanism takes the form of some 
means of selecting particular patterns of activity in preference to others, 
depending upon the input or current state of the nervous system, particular 
physiological models include: (1) mechanisms for reconstituting past activity 
states of the entire CNS or a major portion of it; (2) mechanisms for selecting 
particular output channels as a function of current activity or sensory inputs. 
The specific mechanisms proposed generally fall into one of the following 


four categories: 


(1) Extracellular influences and modification of the neural medium: 
This has been proposed by Kthler (Ref. 45), Bok (Ref. 8), and others, who 
assume that, if a "structural trace" is present at all, it is not laid down in 
specific neurons, but in the surrounding medium, where it is capable of 
modifying activity in nearby neural tracts. The possible form that such 
a mechanism might take has never been specified in detail, and the approach 


is generally discounted by current theorists. The motivation for such a 


Google 


~ 


hypothesis comes in part from attempts at preserving the isomorphism 
between a spatially distributed memory trace and spatially organized 

visual events, as in Kéhler's system. While it is not implausible to assume 
that the surrounding medium participates in the memory trace structure, 

it seerns likely that such interaction between medium and neurons would 

be highly localized, probably influencing only a single neuron or synaptic 
junction, rather than forming a widespread organized structure independent 
of the neurons themselves. If such a position is accepted, then whatever is 
left of this approach can be subsumed under one or another of the remaining 


neural modification mechanisms. 


(2) Threshold Modification: The hypothesis that the threshold 
of an active neuron may be reduced as a consequence of the activity, thus 
making it more likely that this cell will respond to future stimuli, has 
frequently been proposed as a possible memory mechanism (c.f., Taylor, 
Ref. 99 ). If we take the ''threshold'', in its conventional sense, to mean 
the degree of membrane depolarization or the level of input excitation 
which will cause the neuron to discharge, regardless of the particular 
Synapses involved in the transmission of excitation, then this model 
meets two main objections: first, the sensitivity which is acquired is non- 
specific, making it more likely that the cell will respond to any input, rather 
than just those which were effective at the time that the memory trace was 
established; second, after a long history of activity, we would expect the 
thresholds of all neurons to be reduced to a minimum level, unless some 
recovery mechanism exists. If such a recovery mechanism does-exist, 


memory will tend to be lost as a consequence, and it.must be shown that 
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the rate of forgetting would not vitiate the value of the system. Occasionally, 
the concept of "threshold reduction’ seems to be used in the sense of an 
increase in specific sensitivity of a neuron to a particular afferent fiber. 

In this case, the threshold reduction mechanism becomes indistinguishable 


from a synaptic facilitation mechanism, which is considered below. 


(3) Strengthening of active neurons: Eccles (Ref. 18), Uttley 
(Ref. 102), and Rosenblatt (Ref. 79) have proposed models in which the 


output signals of a frequently active neuron gain in strength or effectiveness, 
affecting all terminals alike. This model retains the specificity of response 
of a neuron (unlike the threshold reduction model) but increases its power 

to activate the neurons which follow it in series. If the output signal from 

a neuron goes to a single destination only, this is equivalent to a model which 
strengthens particular synaptic connections. If the output goes to a number 
of different locations, however, there is a lack of specificity in. the channel- 
selection properties of this mechanism, which must generally be offset by 
auxiliary hypotheses. In Rosenblatt (Ref. 79) it ie shown that by means of a 
suitably organized feedback mechanism, a particular output channel can be 
selected through a statistical bias. The feedback guarantees that these cells 
which are reinforced all have at least one ''desirable"’ output connection, the 
other connections being distributed at random among a large number of 
alternative terminal neurons, each of which consequently receives only a 
fraction of the total reinforcement applied. While such a model is shown 

to be logically workable, the specific feedback connections required make 

it physiologically implausible, and it remains less efficient than a model 

in which specific synapses, rather than total neurons, are selected for 


modification. 
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(4) Modification of selected synapses: This model has been 
employed by Culbertson (Ref. 17), Hebb (Ref. 33), and others, and is 


employed in most current perceptron models. The mechanism takes account 
of the correlation of activity between an afferent synapse and the efferent 
neuron, augmenting the strength of the synaptic ending (or, equivalently, 

the sensitivity of the sub-synaptic membrane) if the correlation is positive, 
and, in some cases, diminishing it if the correlation is negative. The 

actual physiological process by which such a correlation might occur is 
obscure, but the logical advantages of such a mechanism are clear. Hebb 
has proposed that actual synaptic growth might occur, improving the contact 
between the transmitting and receiving neuron. While Eccles has considered 
possible synaptic growth mechanisms in some detail (Ref. 18 ) there is little 
evidence to support this conjecture. A possible biochemical mechanism has 
been proposed by this writer (Ref. 83), which assumes that large molecules 
used as catalysts for the production of transmitter substances in the endbulb 
must originate from the nucleoplasm of the post-synaptic cell, and that the 
exchange of these molecules is facilitated by membrane depolarization and 
periods of activity in both cells. An alternative possibility, in which the mem- 
ory mechanism is entirely contained within the post-synaptic cell, is 

that a persistent sensitization of the subsynaptic membrane in the neigh- 
borhood of an active synapse occurs, given the hypermetabolic state which 
follows activity. The facilitation of a neuron's response to repeated sub- 
threshold signals which has been reported by Bullock (Ref. 11) indicates 

that a localized persistent effect of the sort hypothecated does exist; it 
remains to be shown that the subsequent firing of the neuron may serve 

to ''stamp in", or fix in a more permanent manner, the temporary sensi- 


tivity which has been observed. 
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The evaluation of a particular memory hypothesis must depend, 
at this stage, upon its logical power when employed in specific brain models, 
as well as its physiological plausibility. The mechanisms which are consi- 
dered in this report have been selected for their simplicity and their demons- 
trated ability to yield interesting behavioral results. They suggest plausible 
directions in which to look for a physiological mechanism, but it remains 
possible that the actual mechanisms employed by the brain may be of a drasti- 
cally different sort. It is fundamental to this approach, that any lasting 
change in the system, whatever its physical form, may act functionally as a 
memory trace. It seems likely that there is not a single memory mechanism, 
or even only two memory mechanisms at work in the brain, but rather a 
great number of dynamic processes, ranging from temporary facilitation 
and fatigue effects to permanent structural changes, all of which contribute 
in some way to the observed psychological phenomena called ''memory". 
Among these processes, it is likely that one or two play an outstanding role, 
but likely candidates have not yet been found, and in the meantime, it seems 


wise to retain an open mind on the entire question. 


3.2.2 Memory Localization 


There is hardly any more agreement on the question of where 
memory traces are to be found (in the gross anatomy of the nervous 
system) than there is on the question of what they consist of. Lashley 
(Ref. 49) was largely responsible for the emphasis on "distributed memory" 
among many theorists over the last few decades, and Sperry (Ref. 95) has 


contributed a number of experiments which indicate that the residual 
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effects of learning must be widely dispersed throughout the brain. On the 
other hand, Penfield (Ref. 68) has shown that specific recall may be evoked 
by stimulation of specific selected points in the cerebral cortex, E. R. John, 
in a model which is supported by a certain amount of experimental evidence 
(Ref. 39), proposes that the memory traces are distributed between the 
thalamus and cortex, involving reverberating circuits and feedback loops 
between these two regions rather than being localized in one or the other of 


them. 


The question of localization is of less importance for a functional 
model of the brain than is the question of mechanism; as long as we assume 
that it is the network topology, rather than the actual anatomical position of 
neurons, which is important in determining the brain's logical properties, 
there is no reason for requiring that a brain model resemble the biological 
system in its spatial organization. The indirect implications of the different 
theories of localization are of considerable importance, however. For one 
thing, the view that the brain contains its memories in a widely dispersed, 
intermingled form, suggests a mechanism in which the same cells parti- 
Cipate in a great variety of different, and perhaps totaly unrelated, memory 
Organizations. A model which can separate distinct memories from such a 
multiply overwritten system will be quite different in character from one in 
which each remembered event is stored in its own distinct location. For 
another thing, the apparent complexity of memory-sites which may interact 
in the recall of a single experience or association (as emphasized in John's 
work) impresses us with the possibility that human memory may be a 
product of a number of related processes and mechanisms, perhaps 
acting in a complex sequence of cause-and-effect, rather than a simple 


correlation of inputs and outputs. 
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Again, we are stuck with the necessity of simplifying for 
lack of detailed knowledge. While it is likely that memory and recall in 
the human nervous system involves the coordinated activity of several parts 
of a complex structure, we will attempt, at the outset, to see what psycho- 
logical properties can be duplicated by a system in which memory is located 
in a single set of connections, with a minimum of structural differentiation. 
As perceptrons are elaborated into more highly structured models, the 
question of which connections should be allowed to participate in memory 


processes will be reconsidered, and alternative systems will be investigated. 


3.2.3 Isomorphism and the Representation of Structured Information 


Lashley, Kéhler, Greene, MacKay, and others (Refs. 28, 45, 50, 
55, 56, 110) have dealt with various aspects of the problem of isomorphism 
between the representation of an event in the central nervous system and the 
physical structure of the event in the outside world. In the naive isomorphism 
of Kohler, it is required that the representation in the brain should actually 
have a spatial structure resembling the thing that it represents; in the more 
sophisticated form advocated by Greene, it is sufficient that the represen- 
tation should have a logical structure (not necessarily spatial in its physical 
manifestation) which permits it to be broken apart, dissected, and reassembled 
by suitable manipulations or attention-directing processes, in a way which is 
related to the parts, surfaces, or aspects of the real-world phenomenon. 
While some such structural representation seems to be inescapable in 
human perception, thinking, and imagery, the exact form that this might 


take is again almost totally unknown. This is essentially the problem of 
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determining the code employed by the brain in its representation of 
perceptual phenomena. We know that the code is one which enables us to 
recognize parts, relations, symmetries, and other organizational features 
which might be lost in a completely arbitrary representational system (such 
as a code which assigns binary symbols, in sequence, to all stimuli, and 
then lists all of those which are to be considered as ''similar''). We also 
know that there are parts of the brain (the sensory projection areas) in 
which actual spatial organization of stimulus patterns is retained. We do 
not know, however, how far the representational code must go in the 
direction of spatial isomorphism in order to account for the organizational 
properties of experience. As usual, we shall begin with a simplification 
which assumes an unstructured coding, but it seems likely that this will have 
to be abandoned in order to deal with problems of figural representation, 
perception of relations, and other ''gestalt problems''. An attempt will be 
made in this report, however, to show that the required structuring for 
some of these problems may be acquired by adaptive processes and need 


not superficially resemble the phenomena which are represented. 


3.2.4 Adaptive Processes in Perception 


Much of the theoretical work on brain models (Hebb, Hayek, 
etc.) has been concerned with processes by which complex perceptual 
Organizations can be "built up"’ out of sensory fragments, by a process 
of learning or association. Consequently, the question of adaptability, 
or modifiability, of perception is of paramount importance as a guide in 
model construction. The history of this problem has recently been 


reviewed by Hochberg (Ref. 34). Studies of "perceptual learning" have 
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been concerned (1) with the organization of given perceptual elements 
into "concepts", or "kinds of objects", and (2) with the modification of the 


perceptual elements or "impressions" themselves. 


(1)The first type of experiment is concerned with the discrimi- 
nation, rather than the "appearance'' of stimuli. It is clear that much 
recognition and discrimination, as in the learning of speech sounds in a 
new language, is highly dependent upon learning. Such processes typically 
involve differentiation, rather than synthesis of complex patterns out of 
readily identified parts. Another, important part of perceptual concept 
formation is concerned with associating, or classifying readily discrimin- 
able patterns or symbols having the same significance (such as a Roman, 
italic, and script form for the letter "A''). (2) On the other hand, there 
are a number of studies concerned with attempts at modifying the seemingly 
intrinsic "appearance" of the stimulus itself. Such experiments are not 
concerned with refinements in discrimination or assignment of appropriate 
names to stimuli; they are concerned with re-structuring the sensory data 
at a considerably more "primitive" level. Such experiments include 
studies of figural aftereffects (Ref. 25), ambiguous figures (Ref. 107) 
the effect of memory upon color perception (Ref. 10), and the various 
experiments performed with inverting prisms to determine whether a 
human subject could learn to perceive normally with an inverted retinal 
field. Work with animals reared in darkness and exposed to the light 
for the first time in various test situations has been reported by Riesen 
(Ref. 75 ) and Gibson and Walk (Ref. 24) have conducted experiments with 
infants and newborn animals to determine whether depth perception is 
possible prior to learning. Other data have been collected by von Senden for 
congenitally blind human subjects to whom sight is restored by surgery 


(Ref. 106). 
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In general, the conclusions of this work seem to indicate that 
while recognition, in the sense of being able to discriminate and assign an 
appropriate name to an object, is largely dependent upon experience, the 
"subjective appearance" of a stimulus is relatively inflexible, and in some 
species, at least, may be innately given by the structure of the nervous 
system. Sperry's work with frogs, for example, in which the optic nerves 
are cut and then allowed to rejoin with the eyeballs inverted, suggests that 
no amount of relearning can compensate for so drastic a change (Ref. 94) 
and the Gibson-Walk experiments support the assumption of a highly 
developed sense of depth perception in many mammals from birth. Toa 
much lesser degree, modification of visual images by experience is 
possible; generally, this takes the form of persistent field interactions 
(as in figural aftereffects) rather than a basic reorganization of perceptual 
experience. The extent to which perception might be organized by adaptive 
processes is currently unknown, and this is one of the main areas in which 


theoretical brain models may prove helpful to psychology. 


3.2.5 Influence of Motivation on Memory 


In psychological learning theories, it is commonly assumed 
that a "drive" or ''motive'" must be present in order for an animal to 
learn. Conditioned reflex experiments, on the other hand, frequently fail 
to show any relationship between the 'motivation state'' of the animal and 
the learning process. Speculation about the role of motivation in perceptual 
learning has also been quite extensive, and a number of experiments have 
been performed, to test the learning of perceptual discriminations or 
related tasks on the basis of ''mere repetition" as opposed to directed 


learning. In these experiments, it is often hard to distinguish between 
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"attention" and "motivation", and the results are generally inconclusive. 
It seems that a certain amount of "incidental learning" does indeed occur, 
which is not directly relevant to the goal or task of the subject at the time; 
the actual degree of motivation, reward or punishment, or ''reinforcement'' 
that may have been involved, however, is impossible to ascertain in any 
absolute way. For the brain model problem, it is important to note that 
there are some learning situations, at least, in which "reward and punish- 
ment" can be used to control the acquisition of new responses; whether or 
not this is universally the case, and the actual physiological mechanisms 
involved, remain open questions at this time. It should be remembered, 
however, that any brain model which relies on the intervention of an outside 
agent or experimenter to direct the learning process is implicitly taking a 
stand on this issue. A possible compromise is found in the approach of 
Ashby (Ref. 3) where the brain is described as a complex homeostatic 
organization, in which particular ''crucial variables" are capable of 
triggering random changes in organization if they exceed critical limits; 
stabilization of behavior, in such a system, is nota result of learning 
from reward, but is due to the cessation of disruptive changes which occur 
when the system makes a mistake. The main difficulty in making use of 
this approach is in guaranteeing that changes are sufficiently specific and 
well-directed so that the organism achieves its new behavior pattern in an 
economical and relatively direct fashion, rather than going on a random 
walk through all possible alternatives before arriving at the required 


solution. 
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3.2.6 The Nature of Awareness and Cognitive Systems 


While it has been relegated by many theorists of the realm of 
philosophy or semantics rather than science, the question of the nature of 
consciousness or awareness keeps recurring in the literature. Current 
physiologists and psychologists represent the whole range of philosophical 
positions on this subject. For Eccles (Ref. 18 ) there is a conscious 
"mind" which controls the body by acting upon the nervous system. For 
Penfield and Jasper, awareness is a state of the nervous system involving 
heightened sensitivity and improved coordination, under the control of the 
centrencephalic system, and particularly the reticular formation (Ref. 38 ). 
John (Ref.. 39) suggests that awareness may be a property arising from 
the process of 'cortico-reticular resonance'''. For Culbertson (Ref. 17), 
consciousness is a property of trees of causal relations which tie together 
the events of the external physical world and the neural events in the 
brain. Lotka (Ref. 53) has suggested that we look to the world of molecular 
events for an explanation, and that consciousness involves particular 


unstable states of molecular or atomic particles. 


To this writer, it seems likely that the question of the "nature 
of awareness'' can be bypassed, in much the same way that we bypass the 
question of the "nature of perception", by concentrating on the experimental 
and psychological criteria which may be used to distinguish the actual 
phenomena in question. When a subject reports that he is "conscious" or 
that he was recently "unconscious", we are led to believe him or dis- 
believe him on the basis of his behavior, and what he is able to report 


about the content of his "experience'' at the time in question. From an 
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operational point of view, the fact of consciousness" is closely connected 
with the accessibility of information and its ability to influence overt 
behavior; it is, in fact, meaningless to say that an individual is 'conscious'' 
unless there is something that he is conscious of. The questions which can 
be asked concerning this phenomenon in a theoretical brain model (where we 
are not free to assume any intrinsic similarity of processes to those in the 
human brain) are questions of what can be discriminated, "seen", "attended 
to", or "remembered" under specified conditions. All that we can say, 

in the last analysis, is that the system acts as if it were conscious, leaving 
the question of the actual existence of consciousness in the system for 


metaphysicists to consider. 


Systems which represent information internally, in such a way 
that it can be utilized for the control of certain kinds of responses (such as 
running, thinking, or talking) will be called cognitive with respect to the 
realm of information which is represented and the class of responses which 
this information controls. Note that this term is used in a relative, rather 
than an absolute sense. Thus the representation of information in the form 
of an image on the retina is not sufficient to permit us to say whether or 
not the organism is cognitive with respect to its visual environment; we 
must also demonstrate that this information is accessible to the organism 
for the control of some specified set of responses. We might say, for 
example, that a man who automatically stops for a red light, but is 
unable to state afterwards why he stopped is cognitive with respect to 
red signals at the level of overt motor-responses, but not at the level 


of verbal recall. Conversely, an unskilled pianist may be cognitive with 
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respect to errors in his performance at the verbal level, but not at 

the motor control level. We use the term cognitive, then, to indicate 

that knowledge of some realm of information is accessible for the control 

of some specified class of responses. This usage permits us to reserve 
judgement on the definition of such phenomena as perception and awareness, 
and still to recognize a class of psychological phenomena involving the 


accessibility of information, with which we shall be concerned. 
3.3. Experimental Tests of Performance 


The purpose of a theoretical brain model is to demonstrate 
how psychological phenomena can arise from a physical system of 
known structure and functional properties. In the preceding sections of 
this chapter, we have reviewed the physiological data which suggest the 
general form of the model, and the psychological data against which its 
performance must be measured. We now turn to a more specific consi- 
deration of the psychological tests which might be applied to a brain model 
in order to evaluate its performance, and to compare alternative systems 


with one another. 
3.3.1 Discrimination Experiments 

In the simplest type of experiment which can yield psycholo- 
gically significant information about a system, two distinct stimuli are 
presented to the model, which is required to respond differentially to 


them. In the general case, it is not necessary to limit this experiment 


to two specific stimuli or sensory patterns; two or more classes of 
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patterns may be employed, each class consisting of 'similar" patterns, 
such as squares, or triangles, or various sizes and styles of the letter "'A". 
This experiment may be performed either to look for spontaneous discrimi- 
nation by the system, in the absence of intervention or guidance by the 
experimenter, or to study forced discrimination in which the experimenter 
attempts to teach the system to make the required distinctions. Ina 
learning experiment, a perceptron is typically exposed to a sequence of 
patterns containing representatives of each type or class which is to be 
distinguished, and the appropriate choice of response is "reinforced" 
according to some rule for memory modification. The perceptron is then 
presented with a test stimulus, and the probability of giving the appropriate 
response for the class of the stimulus is ascertained. Different results will 
be obtained, depending on whether or not the test stimulus is chosen to 
correspond identically to one of the patterns which were used in the 

training sequence. If the test stimulus is not identical to any of the training 
stimuli, the experiment is not testing "pure discrimination", but involves 
generalization as well. If the test stimulus activates a set of sensory 
elements which are entirely distinct from those which were activated in 
previous exposures to stimuli of the same class, the experiment is a test 
of "pure generalization''. The simplest of perceptrons, which will be 
considered initially, have no capability for pure generalization, but can 

be shown to perform quite respectably in discrimination experiments 
particularly if the test stimulus is nearly identical to one of the patterns 


previously experienced. 
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3.3.2 Generalization Experiments 


As indicated above, a pure generalization experiment is one 
in which the brain model, or perceptron, is required to transfer a selective 
response from one stimulus (say, a square on the left side of the retina) 
to a "similar" stimulus which activates none of the same sensory points 
(a square on the right side of the retina). Generalization of a weaker sort 
may be demonstrated if we simply require the system to transfer a 
response to members of a class of similar stimuli, which are not necessarily 
disjoint from the one which has been seen (or heard or felt) before. As in 
the case of discrimination experiments, it is possible to study either 
spontaneous generalization, in which the criteria for similarity are not 
supplied by an outside agency or experimenter, or forced generalization, 
in which the experimenter's concept of similarity is "taught'' by means of 
a suitable training procedure. Some of the most significant problems in 
brain mechanisms concern generalization phenomena, and particularly 
the meaning of "similarity" for a particular kind of system. In common 
with a number of other theorists (e.g., Pitts and McCulloch, Ref. 71), 
this writer will assume that similarity is primarily determined by a 
group of transformations which stimuli may undergo in a particular 
physical environment. In the normal physical environment, for visual 
stimuli, this would include rigid motions, rotations, size changes, 
projective transformations, certain types of distortions or continuous 
deformations, and changes in color or contrast. A number of more 
subtle forms of similarity (as in styles of architecture, gestures and 
mannerisms, etc.) are presumably due to association of events into 
classes at a higher level of organization than we are concerned with at 


this point. It should be noted, however, that a perceptron which is taught 


-69- 


Google 


to form arbitrary classes of stimuli might be expected to generalize 

along completely arbitrary or abstract dimensions, ''similarity of style" 
being as legitimate a candidate for a basis of classification as ''similarity 

of shape". In the simple perceptrons, we will find that pure generalization" 
does not occur, although an apparent generalization of responses to stimuli 
which share many sensory points with those previously experienced can be 
demonstrated. In this report, this weak form of generalization will be 
considered under "discrimination phenomena", the term "generalization" 
being reserved primarily for cases in which mechanism for recognizing 


actual similarity, rather than a rough approximation to identity, is involved. 


3.3.3 Figure Detection Experiments 


In the experiments considered above, two or more kinds of 
stimuli are always employed, in order to avoid the trivial case in which 
the desired response is automatically evoked by any stimulus that might 
occur. Since it is assumed that at each moment of time exactly one 
stimulus is present, these experiments represent a ''forced choice" 
situation, in which the brain model is obliged to give one of several 
positive identifications in response to whatever it 'sees''. Such experi- 
ments have their counterparts in animal and human experimentation, 
and permit the study of an important class of psychological problems, 
involving simply structured situations. An alternative approach, which 
has been less studied to date, is to give the system the task of searching 
for a particular figure in a sensory field which may or may not contain it. 
In this case, the system is asked to discriminate between ''figure present'' 


and "figure absent", and is typically only instructed in the recognition of 
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one figure at atime. If the figure appears as a solitary object in an 
otherwise empty field, the task is a relatively trivial one. If the figure 
appears against a background, or as part of a complex of other patterns, 
the problem takes on a new aspect of complexity. In the most important 
case, this experiment permits us to study figure-ground organizing 
tendencies in a perceptron, by presenting it with embedded, or ambiguous 
figures which can be recognized as representing one thing if the field is 
appropriately structured, and a different thing if the field is structured 
differently. The Gestalt properties of ''good figure'' are supposed to 
determine the preference of a human observer to perceive one or another 
of the possible figures in such a field. Detection experiments pe mit us 
to compare the preferences and rules of ''good figure'' in a perceptron 
with those of human subjects, in controlled situations. Perceptrons 
considered to date show little resemblance to human subjects in their 
figure-detection capabilities, and gestalt-organizing tendencies. In Part IV 
of this report, some speculations concerning the development of such 


properties in more sophisticated perceptrons will be presented. 


3.3.4 Quantitative Judgement Experiments 


Another type of experiment with which little work has been 
done to date involves the estimation of quantitative properties of stimuli 
(size, distance, position,etc.) by perceptrons. It will be seen that simple 
perceptrons are capable of learning to represent stimuli by a continuously 
variable "analog" type of response. No work has been done to date, however, 
to investigate such questions as the generalization of quantitative judgement 


to new stimuli, or the accuracy which can be achieved in specific cases. 
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For more advanced systems, an important problem which must ultimately 
be faced is that of "perceptual constancies": the tendency in human subjects 
to perceive size, color, or other metric properties of a stimulus in terms 
of the ''actual" physical properties of the object rather than its projection 
on the retina. A man, for example, is perceived to be about six feet tall 
regardless of whether his retinal image subtends one degree or fifteen 
degrees, and a dish appears to be circular in form regardless of whether 
its retinal image is a true circle or an elongated ellipse. It has been 
demonstrated in many psychological experiments that such phenomena 

are not based simply on familiarity with the particular objects involved; 

a completely unfamiliar form, seen in normal physical space, is perceived 
correctly, in terms of its ''true" physical properties, except under 


exceptional circumstances (c.f. Gibson, Ref. 26). 


3.3.5 Sequence Recognition Experiments 


In the above experiments, it has been assumed that the stimuli 
are fixed, temporally invariant patterns. Analogous problems exist, 
involving discrimination, generalization, figure detection, and metric 
estimation for time-varying, or sequential patterns of all sorts. While 
static organization problems reach their greatest degree of complexity 
in the visual modality, temporal organization becomes comparably 
complex in the auditory field. Speech recognition is one particularly 
important case to be investigated. Problems include not only the 
recognition of particular movements, or sequences, but the segmentation 
of movement and sound patterns into figural units,words, or phrases as 
well. The recognition of sequences in rudimentary form is well within the 
capability of suitably organized perceptrons, but the problem of figural 
organization and segmentation presents problems which are just as serious 


here as in the case of static pattern perception. 
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3.3.6 Relation Recognition Experiments 


In a simple perceptron, patterns are recognized before 
"relations"; indeed, abstract relations, such as '' A is above B" or "'the 
triangle is inside the circle'' are never abstracted as such, but can only 
be acquired by means of a sort of exhaustive rote-learning procedure, in 
which every case in which the relation holds is taught to the perceptron 
individually. At the present time, the main hope for the abstraction of 
relations seems to lie in systems which are capable of executing a 
sequence of observations, according to a predetermined plan, in which 
first one member of the related pair is observed and then the other, the 
relationship between them being determined by the sequence of "experience" 
during the shift of attention from the first to the second. The problem of 
relation recognition is, at the outset, more complex than those previously 
considered, since it requires, by its very nature, the ability to recognize 
and attend selectively to at least two distinct parts" of a total organization, 
specifying, for example, which part is larger and which smaller, or which 
part is "outside" and which "inside''. The hypothesis that relation recogni- 
tion involves a sequence, or program,of observation means that it must 
make use not only of figure organization capabilities (to separate the 
"parts" referred to) but of sequence recognition and sequential control 
capabilities as well. The actual experiments by which relation recognition 
can be detected must involve at least two components (such as square and 
triangle) which can be shown in such a way as to exemplify the relationship 
or not. In an ideal experiment, the system would be trained to recognize 
the relation by a number of examples with stimulus patterns or "'parts'' 


which do not resemble or intersect (in their retinal location) the test 
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patterns which are employed in evaluating the performance. If the perceptron 
can then indicate correctly, for entirely new stimuli, whether or not the 
relation holds, it will be considered that the relation has been abstracted 

by the system. 


3.3.7 Program -Learning Experiments 


The learning of sequences of behavior is the counterpart on the 
response side of the problem of seque:ice recognition. The problem has 
been discussed in detail by Lashley (Ref. 50). It requires, as a starting 
point, the ability to form "selective sets'', which introduce a bias to give 
one of several alternative responses to a given: stimulus. A capability of 
this sort has been shown to exist, to some degree, in relatively simple 
perceptrons, provided there is a feedback path from the response units to 
the association system (Ref. 79). To date, little has been done to study this 
capability in a quantitative fashion, but some of the heuristic arguments will 
be reviewed in Chapter 23. One of the most important applications of such 
a capability is in the control of the sequential activity involved in recognition 
of relations, and the "perceptual exploration" of a sensory field. Related 
phenomena, in which this capability plays a central part, are the sequential 
control of speech, thinking, and complex behavior patterns. The represen- 
tation of problem solving activity in the human by heuristic programs has 
been studied by Newell, Shaw, and Simon (Refs. 62, 63), and it seems 
likely that many of their results might be transferred to a perceptron 


which is capable of program controlled activity. 
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3.3.8 Selective Recall Experiments 


While most of the experiments described above involve "memory" 
in the sense of a change in behavior as a consequence of experience, they do 
not, in general, require substantive recall, of the sort which is displayed 
when we describe a person who we saw yesterday, or the location of furni- 
ture in a house where we lived last year. In selective recall experiments, 
the system is required to produce on demand information relevant to a 
particular time, place, or subject. This involves a particular case of 
"selective set'' mechanisms, and can probably be demonstrated in most 


systerns which are capable of program-controlled behavior. 
3.3.9 Other Types of Experiments 


In addition to the experiments considered above, we might 
ultimately wish to consider experiments in abstract concept formation, 
the formation and properties of a ''self concept", creative imagery, and 
other higher-order psychological phenomena. At the present time, these 
problems seem sufficiently remote from the capabilities of present 
perceptrons that we need not consider them further here. Also relegated 
to the future is the consideration of such psychological phenomena as 
perceptual illusions, figural aftereffects, and related phenomena, even 
though these have been considered primary in some of the brain models 
hitherto advanced. It is this writer's belief that these phenomena are s0 
likely to depend on inessential details of brain organization, at almost any 
level of complexity, that it would be a mistake to try to rest the case for 


or against a particular model on a demonstration that it can duplicate a 
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particular kinds of perceptual illusion. It seems more important, at this 
stage, to account for "veridical perception" than for its occasional failures, 
particularly since these are currently demonstrable in a single species only, 


and may lack any generality whatsoever. 


3.3.10 Application of Experimental Designs to Perceptrons 


The designs considered above have been discussed as if they 
were actual "flesh and blood"' experiments, performed with real physical 
systems. In the study of perceptrons, it is not always practical or necessary 
to carry out such experiments in reality; the important thing is that an analysis 
of a given model should always be carried out in terms of an experimental 
design which is specified in sufficient detail so that it could be carried out 


if the system were actually constructed. 


In practise, three main methods are employed in the study of 


perceptrons: 


(1) Mathematical analysis, in which a stimulus environment, 
the rules for stimulus presentation and for the modification of the perceptron's 
memory state are clearly specified. The object of such analysis is, in 
general, to determine the probability of correct performance, or the proba- 
bility of achieving a given performance criterion, for a specified class of 


systems. 


(2) Digital simulation, in which the perceptron, its environment, 
and the memory modification rules are all represented in a digital computer 


program, which carries out the required operations of an experiment in 
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step-by-step fashion, calculating the response of every neuron and cannection 
in the perceptron, and measures the performance of the system. Such a 
program, repeated for a sufficient sample of perceptrons ina class, yields 
much the same type of information as is obtained from a mathematical 

It has the advantage of being free from all approximations (which 


but is less likely to yield important 


analysis. 
may be necessary in some analyses) 
insights into the lawful relations which characterize a class of systems. 


Simulation programs are most valuable as an exploratory device, and for 


the study of systems of such complexity that an exact mathematical analysis 


is impossible. 


(3) Study of physical models, involving the actual construction 


of a hardware device, and the performance of the indicated experiments. At 
present, little is to be gained from the study of actual physical models which 
cannot be learned from the other two methods, but as successive models grow 
in size and complexity, and as means are found for the inexpensive construction 
of electronic models, this method becomes increasingly important. Its main 
virtue is the flexibility and adaptability of a hardware perceptron to new types 
of learning experiments and procedures, and the ability to use ordinary 
physical objects and environments as stimuli, which would otherwise involve 
a great deal of time and expense in computer programming. The physical 
model itself, however, is apt to be less flexible than a simulated system, 


and is best suited for case studies" of a single representative system, 


rather than statistical studies of a class of systems. 


In most of the experiments considered in this report, (which 
are listed in Appendix D) human performance capabilities are sufficiently 


well known to permit us to draw conclusions about possible comparisons 
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between perceptrons and biological systems without further study. In 

some of the proposed experiments, however, (e.g., the figure organization 
experiments described in 3.3.3) additional data may be required on human 
performance in order to obtain a base-line for the quantitative evaluation of 
perceptrons. Thus it seems likely that in the near future, a program in 
experimental psychology with human and animal subjects may be a necessary 
adjunct to the evaluation of our brain models. When this occurs, the models 
are, in effect, being used as predictive devices, capable of generating data 
(probably grossly inaccurate at the outset) which have not yet been actually 
observed in human subjects. The ultimate test for a brain model, from the 
standpoint of psychological validity, is an experiment of this type, in which 
the model correctly predicts phenomena which have yet to be discovered in 


biological systems. 
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4. BASIC DEFINITIONS AND CONCEPTS 


This chapter is devoted to basic definitions of terms which will 
be used throughout the report. It is recommended that the reader familiarize 
himself with this terminology in a general way, on first reading, and refer 
back to this chapter when the terms are reintroduced in the subsequent text. 
A list of standard symbols will also be found in Appendix A. 


4.1 Signals and Signal Transmission Networks 


The following definitions, which are not specific to perceptrons, 


are likely to be helpful: 


DEFINITION 1: A signal may be any measurable variable, such as a 
voltage, current, light intensity, or chemical concentration. 
A signal is typically characterized by its amplitude, time, 


and location. 

DEFINITION 2: A signal generating unit is any physical element, or device, 
capable of emitting a signal. The output signal of the unit 
u“; will be represented by the symbol ul. 

DEFINITION 3: A signal generating function is any function which defines 


the amplitude of the signal emitted by a signal generating 


unit. 
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DEFINITION 4: A connection is any channel (e.g., a wire or nerve fiber) 

by which a signal emitted by one signal generating unit 
(the origin) may be transmitted to another (the terminus). 
A connection <<; 7; is characterized by its origin and 
terminal units (u; and u; , respectively), and by a 
transmission function : which determines the amplitude 

-of the signal induced at the terminus as a function of the 
amplitude and time of the signal generated by the origin 
unit. This signal will be symbolized by Lij (t). 


DEFINITION 5: A signal transmission network is a system of signal generating 


units, linked by connections. 


4.2 Elementary Units, Signals, and States ina Perceptron 


A perceptron (which will be defined in the next section) is a 
signal transmission network containing three types of signal generating 
units: sensory units, association units, and response units. These units 
all have signal generating functions which depend on signals originating 
elsewhere in the network, or else externally, in an outside environment. 


The signals upon which the generating function of a unit depends are called 


* 
In previous reports, the term "'transfer function" has been used for 


this characteristic. Since ''transfer function" has a somewhat different 
meaning in control system theory and elsewhere, it is avoided here, and 
the term "transmission function" is preferred. 
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the input signals to that unit. These units are defined here in a sufficiently 
general manner as to include biological neurons as a special case. We shall 
be chiefly concerned, however, with models which employ simplified versions 


of such neurons. 


DEFINITION 6: A sensory unit (S-unit) is any transducer responding to 
physical energy (e.g., light, sound, pressure, heat, 
radio signals, etc.) by emitting a signal which is some 
function of the input energy. The input signal at time ¢ 
to an S-unit 4; from the environment, W, is symbolized 
ae (t) . The signal which is generated by <4; at time 


t is symbolized Ap (t) 


DEFINITION 7: A simple S-unit is an S-unit which generates an output 
signal 4; = + / if its input signal, ve exceeds a 


given threshold, 9; , and O otherwise. 


DEFINITION 8: An association unit (A-unit) is a signal generating unit 
(typically a logical decision element) having input and 
output connections. An A-unit @; responds to the 
sequence of previous signals Liy received by way of 


> . 
input connections ¢;; , by emitting a signal a; (t) . 


DEFINITION 9: A simple A-unit is a logical decision element, which 
generates an output signal if the algebraic sum of its 
input signals, o¢; , is equal or greater than a threshold 
quantity, 9 >O. The output signal @ : is equal to + / 
if of; 2O and QO otherwise. If aj = +/ , the unit 


oat ae, 


+ 
is said to be active. 
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DEFINITION 10: A response unit (R-unit) is a signal generating unit 
having input connections, and emitting a signal which is 
transmitted outside the network (i.e., to the environment, 
or external system). The emitted signal from unit r; 
will be symbolized by #; - 


DEFINITION 11:A simple R-unit is an R-unit which emits the output 
r"=+4/ if the sum of its input signals is strictly 
positive, and p*=—-/ if the sum of its input signals 
is strictly negative. If the sum of the inputs is zero, 
the output can be considered to be equal to zero or 
indeterminate. (A physical unit which oscillates in 


response to a zero signal would have the required 


properties. ) 


DEFINITION 12:Transmission functions of connections in a perceptron | 
depend on two parameters: the transmission time of the 
connection, 7;; , and the coupling coefficient or value 
of the connection, 7; . The transmission function of 
a connection ¢;; from uj to uw; is of the form: 

“ij (t) = tlre; (t), uj (t-t;;) | - Values may be | 
fixed or variable (depending on time). In the latter 
case, the value is a memory function. 


DEFINITION 13:The activity state of the network at time ¢ is defined 
+ 
by the set of signals, “; , emitted by all signal 


generating units attime ¢ 
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DEFINITION 14: 


DEFINITION 15: 


DEFINITION 16: 


The memory state of a network is the configuration of 


values associated with all (variable valued) connections 


at a specified time. 


The phase space of a network is the space of all possible 
memory states, for a given network. In general, if there 
are N variable-valued connections in the network, the phase 
space may be represented by a region in Euclidean N-space, 
each coordinate corresponding to the value of one connection. 
The memory state of the system at any specified time can 
be characterized by a point in this phase space, and the 
history of the system by a directed line, or path, followed 
by this point. 


The interaction matrix for a network of S, A, and R units 
is the matrix of coupling coefficients, Vey for all pairs 


of units, &; and a“; . If there is no connection from 


“4; tow, wy; is defined as zero. Specifying an 
interaction matrix ie equivalent to specifying a point in 


the phase space. 


4.3 Definition and Classification of Perceptrons 


DEFINITION 17: 


A perceptron is a network of S, A, and R units witha 
variable interaction matrix VY which depends on the 


sequence of past activity states of the network. 
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DEFINITION 18: The logical distance from unit u; to u; is equal to the 
number of connections in the shortest path by which a 


signal can be transmitted from u; to u; . 


DEFINITION 19: A series-coupled perceptron is a system in which all 
connections originating from units at logical distance d 
from the closest S -unit terminate on units at logical 


distance d+/ fromthe closest S -unit. 


DEFINITION 20: A cross-coupled perceptron is a system in which some 
connections join units of the same type (S , A or R ) 
which are at the same logical distance from S -units,. 


all other connections being of the series -coupled type. 


DEFINITION 21: A back-coupled perceptron is a system in which at least 
one A or R_ unit at adistance of, from the closest 
S -unit is the origin of a connection back to an S -unit 
ortoan A -unit ata distance d, < od, from the closest 
S -unit; i.e., this is a system with feedback paths from 
units located near the output end of the system to units 


closer to the sensory end. 
It should be noted that the above definitions are not exhaustive; 
they are intended to designate certain generic classes of perceptrons with 


which we shall be concerned. The initial models to be considered are of the 


type specified by the following definitions: 
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DEFINITION 22: A simple perceptron is any perceptron satisfying the 


following five conditions: 


le There is only one R -unit, with a connection 
from every A -unit. 

2. The perceptron is series-coupled, with connections 
only from § -units to A -units, andfrom A _ -units 


to the R -unift. 


a The values of all sensory to A -unit connections 


are fixed (do not change with time). 


4. The transmission time of every connection is 


either zero or equal to a fixed constant, 7 


5s All signal generating functions of S , A , and R 
units are of the form u; (t) = f(a; (t)) , where 
o;(t) is the algebraic sum of all input signals 


arriving simultaneously at the unit «; 


DEFINITION 23: An elementary perceptron is a simple perceptron with 
simple R- and A- units, and with transmission functions 


of the form hij (t) = u;(t -T)yz;(t). 


Perceptrons can be represented graphically in several different 


ways. In particular, frequent use is made of three types of diagrams, which 


will be called network diagrams, set diagrams, and symbolic diagrams. 
Depending upon the level of specificity required, any one of these diagrams 


may be used to represent the same system. The three types of diagrams 
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are illustrated in Figure 2. The network diagram shows each connection 
and signal unit individually; the arrows indicate the direction of signal 
transmission through the connections. The set diagram represents all 
S-units as a single set, connected to the sef of .A-units (or association 
system) which is represented by a Venn diagram, the subsets of which 

are connected to different R-units. Set diagrams of this general type are 
found to be particularly useful as an aid to analysis. The symbolic diagram 
for this same perceptron merely indicates the kinds of connections which 
exist, namely, Sto A, A to R, andStoS. The perceptron illustrated 


would be called a three-layer perceptron, cross-coupled at the sensory 


layer. 
8 A 
NETWORK DIAGRAM OR 
Oo— O Rk, 
A 
$ 
SET DIAGRAM OR 4 DIAGRAMS OF SAME SYSTEM 


SYMBOLIC DIAGRAM ld ) 
8 Aa ————_“} R 


Figure 2 PERCEPTRON DIAGRAMS 
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4.4 Stimuli and Environments 


DEFINITION 24: A stimulus is any non-zero set of input signals, Lyi (t) , 
tothe S-units attime ¢ . If there are N, sensory 
units in the retina, then a stimulus can be characterized by 
a vector of V, elements, representing the signal to each 

S -unit as an element of the vector. The condition in 
which all input signals are equal to zero is not considered 


a stimulus unless otherwise specified. 


DEFINITION 25: A stimulus world (or environment ) is any set of stimuli, 
defined for a specified S-unit set. The stimulus world 
will be symbolized by W. The number of different stimuli 
will usually be denoted by 7” 


DEFINITION 26: A stimulus -sequence world (or stimulus -sequence 
environment) is any set of stimulus sequences, each 
consisting of an ordered series of stimuli from the set W . 
(For example, if the image of a printed word is a stimulus, 
and W_ consists of all words in a dictionary, then the 
set of all English sentences would comprise a stimulus - 


sequence world. ) 


4.5 Response Functions and Solutions 


DEF INITION:27: A response function is any assignment of R -unit output 
signals to stimuliin W . For a simple perceptron, the 


response function K(W) is a vector of nm elements, 
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( & » R2,+++)Rg  ) indicating the value of the 
response for each of the stimuli, S,, 5,,---, 5S, in 


the environment. 


DEFINITION 28: A classification is an equivalence class of response 
functions. Two response functions are considered 
equivalent if their corresponding elements agree in 
sign. For any perceptron with one simple R -unit, a 
classification, C(W) , divides W into two classes: 

a positive class consisting of all stimuli for which p= +s , 
and a negative class, consisting of those stimuli for which 


DEFINITION 29: A response-sequence function is an assignment of sequences 
of R -unit output signals to stimulus sequences in a 
stimulus-sequence world. This is a generalization of the 


concept of a response function to include a time dimension. 


DEFINITION 30: A solution to a response function (or classification) is said 
to exist for a given perceptron if there is a point in the 
phase space of the perceptron such that the response R; 
(specified by the function) will occur if the stimulus $ ; 


is shown, forall S; in W 


4.6 Reinforcement Systems 


DEFINITION 31: <A reinforcement system is any set of rules by which 
the interaction matrix (or memory state) of a per- 


ceptron may be altered through time. 
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DEFINITION 32: <A reinforcement control system is any system or 
mechanism external to a perceptron which is capable 
of altering the interaction matrix of the perceptron in 
accordance with the rules of a specified reinforcement 
system. 


DEFINITION 33: Positive reinforcement is a reinforcement process in 
which a connection from an active unit uu; which 
terminates on a unit > has its value changed by a 
quantity 4v;;(¢) (or ata rate av;; / dt  ) which 


agrees in sign with the signal uj (t) 


DEFINITION 34: Negative reinforcement is a reinforcement process in 
which a connection from an active unit u; which 
terminates ona unit u; has its value changed by a 
quantity 4v;;(t) (oratarate dv; WA dt ) which 


is opposite in sign from us (t) . 


DEFINITION 35: A monopolar reinforcement system is a reinforcement 
system in which the values of all connections terminating 
onaunit «; remain unchanged attime ¢ unless « f(t) 
is strictly positive. 


DEFINITION 36: A bipolar reinforcement system is a reinforcement 
system in which the values of connections are subject 
to change regardless of whether the output of the 
terminal unit is positive or negative. 
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DEFINITION 37: Alpha system reinforcement is a reinforcement system 
in which all active connections ¢; cj which terminate on 
some unit u,; (i.e., connections for which uw; (t-t) #0) 
are changed by an equal quantity Av; ;( t)= 7 or 
at a constant rate while reinforcement is applied, and 
inactive connections ( uf. (t¢-t) -0) are unchanged at 
time ¢ . A perceptron in which oc -system reinforce- 
ment is employed will be called an oc -perceptron. The 
reinforcement will be called quantized if the change is a 
fixed quantity ( [|42-| =\r) or non-quantized if the value may 


change by an arbitrary magnitude. 


DEFINITION 38: Gamma system reinforcement is a rule for changing the 
values of the input connections to some unit, whereby all 
active connections are first changed by an equal quantity, 
and the total quantity added to the values of the active 
connections is then subtracted from the entire set of 
input connections, being divided equally among them. 
Such a system is said to be conservative in the values, 
since the total of all values can neither increase nor — 


decrease. The change in y,; is equal to 
LD w;; (t) 
v7; (t) = (aw; (t)- ——*~— _) 


where @; - (¢)= 1 if uz (t-T) #0, O otherwise; 


4; * number of connections terminating on uw; 


4 = reinforcement quantity (typically + 1 or 0). 
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Additional reinforcement rules, and variations of the above, 


will be presented as required. The above terminology has been standardized 


in previous work on perceptrons, and represents the systems on which most 


analysis has been done. In most of the cases to be considered, the reinforce- 


ment control system employs one of three training procedures, defined as 


follows: 


DEFINITION 39: 


DEFINITION 40: 


DEFINITION 41: 


(So 


A response-controlled reinforcement system ( R -controlled 
system) is a training procedure in which the magnitude of 


7% is constant, and the sign of 7 is entirely deter- 
mined by the current response, “*” , regardless of the 
current stimulus, 5 . In general, unless otherwise 
specified, this term implies that the reinforcement is 
always positive (i.e., the sign of 7 agrees with the 
signof 7 *, ina simple perceptron). 


A stimulus-controlled reinforcement system (S§ -controlled 
system) is a training procedure in which the magnitude of 


7” is constant, and the sign of 77 is determined 
entirely by the current stimulus, 5S , anda pre- 
determined classification, C(W) ; the current response 
of the perceptron does not influence either the sign or 


magnitude of // 
An error-corrective reinforcement system (error 


correction system) is a training procedure in which 


the magnitude of 7 is O unless the current response 
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of the perceptron is wrong, in which case, the sign of 

7 is determined by the sign of the error. In this 
system, reinforcement is O for a correct response, 
and negative (see Definition 34) for an incorrect response, 
* is the 
required response, 7~* is the obtained response, and f 


or, more generally, 7 =*(k*-r") where & 


is a sign-preserving monotonic function, such that 
#(0) =0. 


In previous reports (Refs. 41, 82)the R -controlled system 
has been referred to as a ''spontaneous learning system", since the 
perceptron evolves in an autonomous fashion, uninfluenced by the "correct- 
ness'' of its outputs. The reinforcement control system requires no 
information from the environment in order to control the changes in the 
memory state of the perceptron. The § --controlled system has also been 
referred to as a ''forced learning system", since the r.c.s. imposes a 
predetermined classification on the perceptron's responses, without taking 


the actual responses of the system into account at any time. 


4.7 Experimental Systems 


DEFINITION 42: An experimental system is a system consisting of a 
perceptron, a stimulus world, W  , anda reinforce- 
ment control system. The reinforcement control 
system may be an automatic regulating device (e.g., 

a thermostat) or a human operator, capable of respond- 
ing to the responses of the perceptron and the stimuli in 
the environment by applying the appropriate reinforcement 


rules, altering the memory state of the perceptron. 
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Figure 4 GENERAL EXPERIMENTAL SYSTEM 
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The basic organization of an experimental system with a simple 
perceptron is shown in Figure 3. A more general system, in which the 
perceptron may be of any variety, and where the output of the perceptron 
is capable of modifying its stimulus environment, is illustrated in Figure 4. 
A comparison with Figure | should indicate the basic similarity between the 
perceptron, in a general experimental system, and the biological nervous 
system. Analyses of perceptron performance always postulate an experi- 
mental system, involving, as a minimum, the components shown in Figure 3. 
The reinforcement control system can be considered a specialized part of 
the environment, in its relation to the perceptron, although it might actually 
be built into the same physical mechanism as the perceptron itself. In an 
R- controlled system, the information channel shown from W to the r.c.s. 
is non-functional, while inan S _ -controlled system the information channel 
from W to the r.c.s. is non-functional, and in an error-correction system, 
both channels are essential for reinforcement control. In digital simulation 
programs, the r.c.s. is the part of the program concerned with reinforcing 
the simulated perceptron, while in experiments with hardware systems it is 


generally a human operator. 


An experiment involves an experimental system, a training 
procedure, and a procedure for testing the perceptron, or measuring its 
performance. A number of typical psychological experiments, which are 
of interest for perceptrons, were outlined in Chapter 3, and some of 


these will be analyzed in the following chapters. 
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PART II 


THREE-LAYER SERIES-COUPLED PERCEPTRONS 
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5. THE EXISTENCE AND ATTAINABILITY OF SOLUTIONS IN 
ELEMENTARY PERCEPTRONS 


The perceptrons to be considered in Part II all consist of 
three layers of units connected in series, with the topology S~A->R. 
In the following chapters, it will be seen that these perceptrons are 
capable of learning any set of responses which we might care to have them 
make to a universe of stimuli. Their main deficiencies are a lack of ability 
to generalize their performance to new stimuli or new situations where they 
have not been explicitly taught and a lack of ability to analyze complex 


environmental situations into simpler parts. 


The first perceptron model to be considered in detail is the 
elementary oO¢ -perceptron. In this chapter, we shall examine the intrinsic 
ability of such systems to realize solutions to classification problems, 
including several theorems concerning the relationship of the size of the 
system to the existence of solutions, and the possibility of attaining such 
solutions by different training procedures. The term ''solution" is used in 
the sense of Def. 30, in Chapter 4. Most of these results were first presented 
in Ref. 86. 


5.1 Description of Elementary Ot -Perceptrons 


Elementary o¢-perceptrons were defined in Chapter 4, as a 
subclass of simple perceptrons, in which S-units send connections to 
A-units, and the A-units all send connections to a single R-unit, no 


other connections being permitted, and all connections having equal trans - 
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mission times, 2 . Without loss of generality, 7 can be taken to be 
zero, and this assumption of instantaneous transmission will be made: 
whenever we deal with simple perceptrons, unless otherwise stated. The 
A-units and R -unit in all elementary perceptrons are of the simple type, 
i.e., they have a threshold, @ , (equal to zero in the case of the R -unit) 
and emit a signal only if the input signal, oc , is equal or greater than @ 
The connections from § to A -units have fixed values, and the connections 
from the A-units to the R -unit have variable values, which depend on the 
history of reinforcements applied to the perceptron. The connections, in an 
elementary perceptron, all have the transfer function (assuming 7 to be 


zero). 


4, (t) = uz (t) v;; Ce) 


In the Od -system, which is to be considered initially, the reinforcement 


rule takes the form 


Y if <;(t) 20 
oO otherwise 


Av; ; (t) = u;* (th -{ 
In an elementary perceptron, where the only variable connections occur 
from A -units tothe R -unit, the simplified notation 2; will generally 
be taken to mean the value of the connection from unit @; tothe R -unit. 
The basic parameters with which we shall be concerned in this chapter are 
the number of S -units, NW, , andthe number of A -units, Ng . 
Without loss of generality, we can assume the Vg sensory units to be 
situated at points in a two-dimensional field, or ''retina'', and regard the 
input stimuli as patterns of illumination on the retina. A typical system 
of this type is illustrated in Figure 5. 
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RETINA OF 
$-UNITS A-UNITS 


Figure 5 NETWORK ORGANIZATION OF A TYPICAL ELEMENTARY PERCEPTRON 


5.2 The Existence of Universal Perceptrons 


Most of the theoretical results obtained to date for elementary 
perceptrons are concerned with experiments in which a classification of an 


environment, C(W) , is taught to the perceptron by some training proce- 
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dure. The first theorems to be considered deal with the question of whether 
a solution to such a classification problem exists, or might exist, for a 
given perceptron. To begin with, the following theorem shows that the 
organization of an elementary perceptron is sufficient to permit the 
construction of a "universal system", for which a solution exists for every 
possible classification, C(W) . Perceptrons constructed in this manner 

are generally not very interesting as brain models, but the theorem indicates 
the wide range of possible behavior which might be obtained from such 


systems ° 


THEOREM 1: Given a retina with two-state (on or off) input signals, 
the clase of elementary perceptrons for which a 
solution exists to every classification, C(W) , of 


possible environments W  , is non-empty. 


PROOF: Since it is sufficient to show the existence of such a perceptron, 
we proceed by construction. Let there be one A -unit for every possible 
stimulus configuration on the retina. Consider stimulus S; and its 
corresponding A -unit, @; . Let @; have an excitatory connection 
(value equal to + 1 ) originating from every "on" point in S$; , and an 
inhibitory connection from every "off" point in S; , and let its threshold 
be equal to the number of excitatory connections. Then there will be one 
and only one A -unit responding to every possible stimulus, and no 
A-unit. responds to more than one stimulus. (We say that a; "responds" 
to S; if ars O .) Now consider any stimulus world, W , defined on 
the retina, and a corresponding classification, C(W), which associates 


a positive or negative classification with each stimulus, S$; ,in W . 
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In order to realize the classification, it is only necessary to set the 
value of the connection from @; equal to + 1 if the class of S; is positive, 


or - 1 if the class of S; is negative. Q.E.D. 


While this solution is clearly uneconomical and of little practical 
interest, it is sufficient to show that there are no "special cases" of 
classifications which have no solution, at least for a retina of binary elements. 
If the inputs to the S-units are capable of taking on more than two values, 
then a more elaborate construction (e.g., one which separates each combination 
of input values to a different set of A-units) would be required. It is left to 
the reader to satisfy himself that a system with less ''depth" than an elementary 
perceptron (i.e., one in which S-units are connected directly to the R-unit, 
with no intervening A-units) is incapable of representing a solution to every 


C(W) , no matter how the values of the connections are distributed. 


5.3 The G-matrix of an Elementary ot -Perceptron 


In practice, the cases of interest are those in which each 
stimulus activates some set of A-units, and each A-unit is likely to 
respond to a great many different stimuli in W . In order to deal with 
such systems, the concept of a G-matrix has been found to be particularly 
helpful, and this will now be defined. The definition given here is suffi- 
cient for elementary perceptrons, and will be generalized in a later 


chapter to permit us to deal with more complex systems. 
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DEFINITION: Consider a (simple) perceptron, and a stimulus world, W , 
consisting of » stimuli. Then the matrix 


Gn Fie °°? Yip 
Sey Geo ° °° Gtn 


Int In2 °°? Inn 


consists of elements Yj called generalization coefficients. Each 
element, is equal to the total change in value ( FY Az, i=) over 
all A-units in the set responding to S; _ if the set of units responding to 


Sj 


* 
the number of A-units in the system). For simple perceptrons and a 


are each reinforced with, n equal to 7 /Na (where 4, is equal to 
given environment, 6 is fixed for all time. 


If we are dealing with a particular o¢ -perceptron, where 


Avy = a4 l(t)” » we have 


Ji; * %,j 
where @, 7 the proportion of A-units which respond both to S; 
and S; ; 


If we are dealing with a randomly selected member of a class. of perceptrons, 
@; 7 is a random variable, and we have the equation for the expected 
value of 9% 
E95 = Wii 
where @Q;- = the probability that an A-unit in a given class of 
perceptrons responds to both stimuli, S; and S$; 


With = 1/Na we have a "normalized G-matrix''. For some purposes 
it is convenient to take =/ , in which case the "unormalized G-matrix" 
is equalto NW, times the normalized matrix defined above. 
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For the dc -system, 94; is simply a measure of the inter- 
section of the sets of A-units responding to S$; andto S$ jy? and is 
equivalent to a ''set intersection matrix". 6 is always symmetric for 
an alpha system. In any elementary perceptron (at a giventime 6 ) 
the net input signal to the R-unit from the set of A-units responding to 
stimulus S; will be called «; and is given by 


lg = ye (S;) 2974 Ly + G2 Met oes KGin Xp (5.1) 


where Z os the amount of reinforcement applied to the system, over all 
* 
appearances of 5; prior totime #£ . In matrix form, the vector « 


of signals &; from all stimuli S; in W is given by 


a = 6X2 (5.2) 


where X isa vector of elements x; » defined as above. 
5.4 Conditions for the Existence of Solutions 


In general, if we are given the rules of organization of a 
perceptron and some classification, C(W), it is by no means easy to 
say whether or not a solution to C(W) exists for the perceptron in question. 
The following theorems deal with the existence of such solutions from 
several different points of view. We first define the bias ratio of an A-unit 


as follows: 


DEFINITION: Given a classification, C(W), the bias ratio of an A-unit, 
@; , is defined for any set of stimuli in W as ns? ne , where 77,73 
number of stimuli in the set which are members of the positive class C* and 
which activate a; ; m;” = number of stimuli in the set which are members 
of the negative class C™ and which activate a; . 


* It is assumed here that all initial v7, < O. 
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THEOREM 2: Given an elementary perceptron and a classification 
C(W) , the following conditions are necessary 

but not sufficient for a solution to C(W) to exist: 

i) Every stimulus must activate at least one A -unit; 

ii) There should be no subset of stimuli containing at 
least one member of each class, such that in the 
union of the responding A -unit sets, every A -unit 
has the same bias ratio (with respect to the stimuli 


of the subset). 


PROOF: We first prove that the conditions are necessary. Condition i) 


is obvious. The proof that condition ii) is necessary is as follows: 


Assume there is a subset violating this condition. Let « 7 = 
input signal to & generated by stimulus S$ a, Ss Then summing the values of 
all such signals from stimuli of the positive class in this subset, we have 
(since violation of ii) requires that n/n is constant for A -units 


responding to stimuli in this subset). 


+ + 
La Laity - 2 Vagy = 2e Lia; 
S;ec* é é é rr a 


Thus the sum of the R -unit input signals for stimuli of the positive 
class must have the same sign as the sum of the R -unit input signals 
for stimuli of the’second class. But then one of the sums must disagree 
in sign with the sign of the class, and therefore, one of its components 
(i.e., one of the iw; ) must disagree in sign with the class, indicating 


that at least one stimulus must be classified incorrectly. 
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To show that these conditions are not generally sufficient, 
consider the following example: Let there be five stimuli, and four A -units. 


The A -units activated by each stimulus are: 


S, activates a, 

S52 activates a2 2 

Ss 3 activates a3 and ay 

Sy activates a,,@,, and a, 


S; activates a,, a, , and a, 


Let the positive class consist of 5, , 5S, , and 5, , and the negative 
class consist of S, and 5S, . Then the bias ratios for a, and @, are 
not the same as for @, and a, . Also, there exists no subset with 
stimuli from each class, with equal bias ratios for all A -units. The 
values of @, and a, must be positive, and the sum of the values of 2; 
and a, must also be positive,to obtain the required the required classifi- 
cation for the members of the first class. But then it is clear that either 
Sy Or Sg must be classified incorrectly, which proves that conditions i) 


and ii) are not sufficient. a 


In the next theorem we make use of the symbol i to denote 
a signal vector, such that the element u; agrees in sign with the 
classficiation of S; in C(W) . Such a signal vector will evoke the 
correct response for each stimulus in W . Two such vectors which 
agree in the signs of their elements are said to be in the same orthant 


(generalized quadrant, in 7 dimensions). 


In Theorem 9, a necessary and sufficient condition, closely related 
to the above, will be presented. 
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THEOREM 3: Given an elementary ac -perceptron, a stimulus world W , 
and any classification C(W); then in order for a solution 
to C(W) to exist, it is necessary and sufficient that there 


exist some vector « inthe same orthant as C(W) , and 


some vector Z such that Gxv-#4u. 


PROOF: The proof would follow trivially from Equation (5.2) and the 
definition of «a , were it not for the possibility that a solution might 
exist involving some unique assignment of values to the A-R connections, 
which could not be attained by any reinforcement vector, x , defined as in 
Equation (5.1). It will be shown, therefore, that if a solution exists, in the 
form of any assignment of values to A-R connections, an equivalent solution 
must exist corresponding to the reinforcement of each stimulus, S; , by an 
amount z;. For brevity, throughout the following discussion, we will speak 
of “the value of an A -unit'' in place of ''the value of the connection from an 


A -unit tothe R -unit''. The following definitions and notation will be used: 


1 ifthe A -unit a; responds to S° 
#* 
a;(S;) = 
0 otherwise 


A isann by Ng matrix, in which the element a; = a} (S;). 
A solution to a classification problem is said to exist if there is some 
distribution of values over the A -units which enables the perceptron to 


perform the discrimination; i.e., there exist vectors ~ and « such 
that 


Avru 
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Consider the matrix AA’. The é,/ element of this matrix (say A; 7 ) is 
Z ag (Si) a4 (S;) Aj; 


But the (un-normalized) G -matrix for an 0o¢ -system, expressed in 


terms of the above functions, has elements, 
ie # 
a ie z ag ($;) a4 (S;) 


so that the matrix 6 =AA’. Note that this shows that G is either 


positive definite or positive semidefinite. 
We then have, for any vector 2% , such that x'A=O 


1) LA @O => 2'AA = £'6 =O 


2) 26 =O => U'6x = LUAX — (LA, LA) OO => X40 


Hence, the rank of G = rankof A_, since any vector X which is in 

the left null space of G is also in the left null space of A_ ; therefore the 
left null spaces of G and A are identical. Since the rank plus the 
dimension of the null space is equal to the dimension of the domain, G and 


4 must be of the same rank. 


But the columns of 6 are linear combinations of the columns of 
A , hence the space spanned by the columns of 6 is identical with the 


Space spanned by the columns of A 
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Since Av is a linear combination of the columns of A, the 
existence of 7 and uw such that Ax =u implies the existence of a vector 
% suchthat Gxr-=u . Thus, if a solution exists, there is a solution to 
the equation Gzx=u , so that the condition of the theorem is necessary. 
But it is also sufficient, since u by definition represents a solution 


vector. Q.E.D. 


COROLLARY 1: Given an elementary perceptron and a stimulus world W , 
Then if G is singular, some C(W) exists for which 


there is no solution. 


PROOF: Each C(W) requires a solution vector in a different orthant, and 
the set of all C(W) , fora given W , requires solutions in every possible 
orthant. But if G is singular, it maps the entire space into a hyperplane, 
and this plane must fail to intersect certain orthants. Consequently, the 
classifications (C(W) which are represented by vectors in these orthants 


have no solution. 


COROLLARY 2: Given an elementary perceptron, if the number of stimuli 
in W is n>>N,, there is some C(W) for which no 


solution exists. 
PROOF: From Theorem 3 and Corollary 1, it is clear that there will 
be some C(W) which has no solution if and only if G is singular. G 


has the same rank as the matrix A ; but A isann by Ng matrix, 


implying that A, and therefore G has rank <n. 
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COROLLARY 3: For any elementary perceptron, as the number 7” of 
stimuli in W increases, the probability that a randomly 
selected classification, C(w) , has a solution approaches 
zero (where C(W) is chosen from a uniform distribution 


over the possible classifications of W ). 


PROOF: From Corollary 2, as 7” increases beyond the number of A-units 

in the perceptron, there must be some C(W) without a solution. At the same 
time, increasing ” increases the set of possible classifications in proportion 
to 2” . But,owing to a theorem by R. D. Joseph and Louise Hay (Ref. 41, 
Appendix ), the number 7”(r) of classifications which have solutions is no 
greater than 2|("s , + (771) eene#(22))] where 7< WA, is the rank of the 
G-matrix. Therefore, the upper bound of the probability of selecting at random 
one of the classifications which has a solution diminishes with n(r)/2” which 


goes to zero as 7 goes to infinity. 


Several additional tests for the existence of solutions, which are 
of practical utility in diagnosing small systems, will be found in Theorems 9 


and 10, at the end of this chapter. 


5.5 The Principal Convergence Theorem 


In the preceding section, the existence of solutions to classification 
problems in an elementary perceptron was considered, but nothing has been. 
said about the ability to achieve such a solution by a training procedure. In 
this section, we consider the ability of an elementary oc -perceptron to learn 
the solution to a classification C(W) under an error correction procedure. 


The following theorem is fundamental to the theory of perceptrons. 
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A general definition of an error correction procedure was given 
in Definition 41, in Chapter 4. We now define in detail two specific forms of 
this procedure, as they apply to the elementary c<¢ -perceptron. 


Consider some classification, C(W). Let 


+1 if stimulus 5; is to be in the positive class 


-1 if stimulus S; is to be in the negative class 


where C= lyoeeg rs 


In order to obtain the most general conditions for the following theorem, a 
non-quantized error correction procedure is defined as follows: No response 
will be considered correct unless the magnitude of the input signal to the 
R-unit (u;) is greater than ¢& , and the signof u; agrees with 0; 

for the current stimulus. (This corresponds to an R-unit with a threshold 

of of , or for the special case where oo = 0, it corresponds to a simple 
R-unit.) If no error occurs for stimulus 5S; 
reinforcement occurs; but if an error does occur a quantity 7 = 27;A%; 


(i.e., J;u; > S )no 


is added to the value of each active A-unit, Ax; (the number of units of 
reinforcement) being just sufficient to bring the magnitude of the signal u; 
past the threshold level, co ,tothelevel €>G . Ina quantized 
correction procedure, the identical rules apply, except that 7 = 9;4%; =f}, 


Ax; representing a single unit of reinforcement. 
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THEOREM 4: Given an elementary © -perceptron, a stimulus 
world W , and any classification C(W) for which a 
solution exists; let all stimuli in W occur in any 
sequence, provided that each stimulus must reoccur 
in finite time; then beginning from an arbitrary initial 
state, an error correction procedure (quantized or 
non-quantized)will always yield a solution to C(W) in 
finite time, with all signals to the R-unit having magni- 


tudes at least equal to an arbitrary quantity ¢ 2 0. 


PROOF: The matrix A is defined as in Theorem 3, sothat @;; = aF(5;) : 
We recall that AA' * G. We also define the matrix B such that 
54; * 7; a: (S;) ; the matrix H = 88’; and the diagonal matrix 0 
such that d;; = d;;0;. Note that DD = I, DA =6, andH =0G0D. 


We first consider the non-quantized error correction procedure. 
In this case, no reinforcement is applied unless an error occurs; if an error 
does occur (when ;u; £ S )the quantity jo, Ax; (Ax; > 0) is added 
to the value of each active A-unit, 42%; being chosen so that the input to 
the response unit is exactly (2; € (€ >). It will be shown below that 


sucha Ax; exists. 


The proof of this theorem (which was first published by Rosenblatt in 
Ref. 86) has undergone a number of modifications. The original treat- 
ment was insufficient to prove the theorem in a rigorous fashion; 
subsequent forms have been due to Block, Joseph, Kesten, and others; 
and the present proof owes much to each of these. An interesting 
alternative approach, with a slightly modified reinforcement procedure, 
has recently been proposed by Papert (Ref. 67) who attempts to shorten 
the demonstration and avoids use of the G-matrix. Unfortunately, there 
are several logical errors in Papert's argument, the correction of which 
would tend to lengthen his demonstration. 
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It has been noted previously that the space spanned by the columns 
of G is the same as the space spanned by the columns of A (the rank of 
G being equal to the rank of A ). Consequently, for any Ng -vector V_ , 
there is an 7 -vector Z such that AV=GZ. 


An arbitrary initial state for the perceptron is represented by an 
Na -vector V”° of values for the A-units. Let Z° be a corresponding 
nm -vector. Let Z bethe 7 -vector whose 7 component, 3; , is 
equal to the total quantity of reinforcement given in all previous corrections 
for stimulus S; , i.e., 


2. = 2: P, 4x; (summing over all previous corrections). 


Let U = GZ°+ GZ = G(Z°+#Z) = GD(X°+X) where X°=DZ° and 


X = DZ . The i component of VU, u; , would be the input to the 


é 


R-unit if S; were to occur at the present time. Let W= DU . This 


equation can be written 
W = H(X°+X) 


where a negative <. (or more precisely, «-; < f ) represents an error. 

The Z; are always non-negative, and this will be understood for the 

remainder of the proof. We now define / as the maximum diagonal element, 
hz; , of H . We also define the function of the nm -vector Z 


"i 
K(Z) = z'HZ-26). 3, 


c=/ 
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We then obtain the following results: 


1) The existence of a solution means that there is an N, vector /Y* such 


that for all ¢ 


basi) oF “wi 


where «>0O. In matrix form Bv*=w’" 


2) Consider X'‘HX for all X such that ||X|| = / (and of course x;20 »). 
X/HX = (X'B)(X'B)’ sothat X'HX >O. Suppose X'HX¥ =O ; then X8=0 
Clearly X¥’'W* >O , but X'W"*= X'8V"=0. This contradiction shows 
that X’HX >0O on this closed, bounded set, so that there exists a minimum 
oc > O suchthat X‘HxX > x||Xl|\7 for all X for which 7,20 forall ¢ 


Note that M 2 co >O as a consequence. Note also that 9-. =h;; 2@>0. 


3) Pe = < yn |x (Schwarz's inequality) 
and |X’HX°| < |lax?ll- xi = € |x (Schwarz's inequality) 


K(X) + 2X'HX?® 


4) K(X°#X) - K(X®) 


a 
a |x - 2eY%n |x| - 24 IIx 


lv 


_ (Aretn)* 
7 Oc 
OK(X°+#X) 
eer me 
Aur? : ; . 
and Bae = Des oO . This latter relation proves the contention at 
d 
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the beginning of the proof that 4z; 2 O exists. Specifically, we have 


Ax; wo SO 


ee 
éé 


6) A-correction is made for S; onlyif u-; < ¢ . Denote the change in K 
when this is done by 4 , and by subscript O the conditions before the 


correction. 


Xiot Ax; € 


2 | (urj-6) dz; -2/ Fe (urs -€) ary 
é 
a; 


éo oat) 


AK (X°+X,) 


air (wi-e)” 
, A" 
~ (“4% €)? 


IN 
me 
nm 
' 
Q 
— 


7) From 4) and 6) we conclude that the maximum number of corrections 


is 


a. M(h+ ern) 
7 of (€ -S) 
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8) In particular, if X °=0 and ¢=0 (corresponding to a perceptron with 
a simple R-unit and no initial reinforcement) then 4 = || 4X°|| = 0 and 
the bound becomes »M//cx . 


This proves the theorem for the case of the non-quantized 
correction procedure, since NV is finite, implying that the process arrives 
at a solution in finite time. For the quantized case, we have the condition 
that Ax- is always 1 when a correction occurs (the vector X representing 
the numbers of unit corrections for each of the mn stimuli). For convenience, 


we take the case where ¢=0 and € = M =(g;;),,,, + Then in step 6) 


we have: 
Zigh! XZigrt 
6a) AK (X°+X,) = 2] (are-M) dx; = 2] [arin t hii (%i-Xj_)-M] x, 
aft) Xo 
+1 
= 2 [erie 5 -Mirtz +t ik (x; - 


= 2( «io -M+ 4.) 


zs -M 


* 
7a) From 4) and 6a) we have that the maximum number of corrections is 


a 
ny « (4+Myn)" 


oc M 


An alternative bound, found by H. Kesten, is az max (- 2p; a; oe hii) - 
This under some circumstances represents a sharper bound nonetheless, 
both bounds are generally quite poor, as estimates of the actual number 
of steps. 
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8a) This upper bound is again minimized when X°=(0 sothat 4=|| HX °| = 0. 
The bound is then 1M/oc 


This completes the proof of the theorem for the quantized case. 


Q.E.D. 


COROLLARY : Given an elementary perceptron, a stimulus world W , 
and any classification C(W) ; then if a solution to C(W) 
exists, the set of possible solutions to C(W) has positive 


measure over the phase space of the perceptron. 


PROOF: From the proof of the theorem, we know that if a solution exists, 
there is a strictly positive vector X suchthat 4H¥ =P (where P isa 
strictly positive vector). Let Y be any 7 -vector; then ||¥y|| € 4 lly 
where 5 is the absolute value of the maximum eigenvalue of 4 , or the 


normof 4 . Let %« = me P; > Oo ,andilet € =yu/(b+!). Let U 


be inthe € -sphere around X , i.e., U=/+Y where [yl < € . Let 
& 
Z=HrY ,andlet E = 2% 3, < Izil = Ivril< rer <u. Then 


f; + 3; z>u-& > O 
HU =H(UtY) = P#2Z 
Therefore, HU is strictly positive, and (/ is an alternative solution. 
This means that there is a cone of vectors including X which maps 
into the region which contains / , any such vector representing an equiva- 


lent solution. Since the volume of this cone has positive measure over the 


phase space, the corollary follows. 
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5.6 Additional Convergence Theorems 


The theorem in the previous section deals with convergence to a 
solution state inan c -perceptron, trained by the error correction procedure. 
In this section, it will be shown, first, that a weaker form of correction 
procedure can also be guaranteed to yield a solution; secondly, that 
reinforcement procedures in which the magnitude of 7” does not depend on 
whether or not the current response is correct cannot, in general, be relied 
on to converge to a solution. If a solution state does occur in such a system, 


it will be shown that it is apt to be unstable except under special conditions. 


DEFINITION: A random-sign correction procedure is one in which some 
quantity of reinforcement is applied to the perceptron when an error occurs, 
and zero reinforcement is applied when the response is correct. The sign 
of 7 is chosen at random, with an equal probability of being positive or 


negative, regardless of the response of the perceptron. 


THEOREM 5: Given an elementary oc -perceptron, with a finite 
number of memory states, a random-sequence stimulus 
world W , and any classification C(W) for which a 
solution can be reached from the starting point by some 
reinforcement sequence, then a solution will be obtained 
in finite time with probability 1 by means of a random- 


sign correction procedure. 
PROOF: The random-sign correction procedure consists of a random 
walk in which each step corresponds either to a step of the required 


correction process, or a step in the reverse direction. In the course of 


this process, the vector « (defined in connection with Theorem 4) will 
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eventually reach some attainable trapping state with probability 1. But the 
only trapping states are in the solution space. Consequently, a solution 


will be obtained in finite time. 


In Chapter 4, (Definition 40) an S-controlled reinforcement 
system was defined as a training procedure in which the magnitude of 77 is 
constant, regardless of the current response of the system, the sign of 7 
being chosen to agree with the sign of the classification of the current stimulus, 

S: ,»,in C(W). Unlike the methods considered previously in this chapter, 
this is not a correction procedure; i.e., the magnitude of reinforcement does 
not depend on the occurrence of an error, and only the sign of the required 
response is taken into consideration in determining what reinforcement 
should be applied. In the following analysis, a solution will be called stable 
if, ina given experimental system, all future memory states will also 
satisfy the conditions of a solution, no matter how long the experiment 
continues. A system employing a correction procedure, since it receives 
no further reinforcement once a solution state is achieved, is inherently 
stable. The following theorem shows that this is not the case for an 


S -controlled system. 


THEOREM 6: Given an elementary © -perceptron, a stimulus world W , 
and some classification C(W) for which a solution exists, 
a solution can sometimes be achieved by an S -controlled 
reinforcement procedure. However, such a solution cannot 
be guaranteed for an arbitrary stimulus sequente; and may be 
unstable if it occurs. 

PROOF: We will first. consider a case in which a stable solution does occur, 

for the type of experimental system specified by the theorem. Let W consist 


of two stimuli, 5, and S,. Let S, activate some set of A-units, A, , 
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and let S, activate a disjoint set of A-units, A, . Let C(w) assign S, 

to the positive class and S, to the negative class. Regardless of the 

sequence and relative frequency of S, and S, , it is clear that each 
occurrence of S, will augment u, ina positive direction, while each 
occurrence of S, will make u, increasingly negative. Since the intersection 
A,z, is assumed to have zero measure, there will be no interference between 
the two stimuli, so that the acquired solution will remain stable no matter how 
long the process continues. This example proves the first part of the theorem. 
Let us now consider the case of intersecting A-unit sets. Suppose S, activates 
two units, a, and a, , while S, activates units a, and a, (the unit a, 
responding to both stimuli). If the frequencies of S$, and S, are equal, their 
effect on @, will tend to cancel, and a solution with Vv; positive, v7, negative, 
and 77, equal to zero will tend to occur. As the sequence continues, the magni- 
tudes of 7 and %y will tend to increase without bound, so that the solution 
will become increasingly stable as time goes on. Suppose, on the other hand, 
that S, occurs with ten times the frequency of S$, . In this case, a, will 

gain ten units of positive value for every unit of negative value received from 
S, » so that 2% will tend to increase in a positive direction at nine times 

the rate that 1, progresses in a negative direction. Thus the net signal, u, , 
transmitted to the R-unit in response to S2 » which is equalto wz, + wm , 
will clearly become strongly positive as time goes on, resulting in an 
erroneous classification of S$, . Even if the initial state of the perceptron 
was a solution state (e.g., uy = tl, v= -/, w= 0 ) it is clear that 

the S-controlled procedure will quickly destroy the existing solution, which 


4 
is therefore unstable. Q.E.D. 
2 


H. D. Block has pointed out that, while a solution to C(W) can not be guaran- 
teed with a random stimulus sequence, nonetheless if a solution exists then 
there exists some S-sequence which will guarantee a solution with S-controlled 
reinforcement. In particular, if Gz =u isa solution, then the occurrence of 
S$; withfrequency f; = |z| (forall ¢ ) will guarantee a solution. 
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In the example considered above, it is clear that a frequency bias, 
in which the stimuli of one class are much more frequent than members of the 
other class, can strongly prejudice the perceptron to always give the response 
associated with the more frequent class, in an S-controlled system. Sucha 
problem would exist,for example, in trying to teach a perceptron to distinguish 
the letters "E" and "X'' occuring with their normal frequency in English text. 
Even if all stimuli occur with equal frequency, however, a similar effect 
exists if there is a size bias, in which the stimuli in one class activate 
more S-points (or illuminate a larger area of the retina) than the other class. 
As will be seen in the following chapter, larger stimuli generally tend to 
activate more A-units than smaller stimuli, and in the limiting case, the set 
of A-units responding to a smaller stimulus may be entirely contained within 
the set responding to a larger stimulus. Suppose for example, that S, 
activates units Q, and Q,: while S 2 only activates a 2° A solution which 
classifies S, positively and S, negatively clearly exists (e.g., let 7 = +5 
and vz, = -—/ ) but if the stimuli occur alternately, wu, will tend to become 
increasingly positive, while i, tends to oscillate about zero. The reader 
can satisfy himself that (starting with O values) a quantized error correction 


procedure yields a stable solution to this problem after five stimuli. 


In the case of R-controlled reinforcement procedures (Definition 39 
in Chapter 4) it makes no sense to talk about the probability of convergence to 
solution for an arbitrary classification, C(W) , since the required classi- 
fication plays no part whatever in determining either the sign or the 
Magnitude of the reinforcement. As will be shown later, it may happen 
that an R-controlled reinforcement system leads to the acquisition of an 
interesting stable response function by a perceptron, but this cannot 


generally be guaranteed, and any classification which is achieved is necessa- 
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rily one which is selected by the perceptron, rather than by the experi- 
menter. The interesting questions concerning such systems deal with the 
types of classifications to which they converge, for different kinds of 
environments. In particular, we will be interested in any systems which 
tend to form classifications on the basis of some concept of stimulus 
"similarity". It will be shown in later chapters that elementary perceptrons 
do not, in general, tend to form classes on this basis except under special, 
and highly restrictive, environmental conditions, but that cross-coupled 
perceptrons appear to have a striking capability for such "spontaneous 


organization". 


In the preceding theorems, only perceptrons employing alpha 
system reinforcement have been considered. The remaining two theorems 
consider two departures from this model. The first demonstrates that an 
even weaker form of reinforcement than that in. the random-sign correction 
procedure can guarantee a solution in finite time, provided it is employed in 
a correction procedure, in which the application of reinforcement depends 
upon the occurrence of response errors. We define a random perturbation 
correction procedure as a reinforcement process in which, if an error occurs, 
reinforcement is applied to the active A-units, as inthe od -system, except 
that the magnitude and sign of 7 are both chosen independently and 
separately for each reinforced connection in the system, according to some 


probability distribution. 


THEOREM 7: Given an elementary perceptron with a finite number 
of memory states, a stimulus world W, anda classi- 
fication C(W) for which a solution can be reached 
from the starting point by some reinforcement sequence, 
then a solution can always be obtained in finite time by 


means of a random perturbation correction procedure. 
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PROOF: The reinforcement process is a random walk, which (for the 

given conditions) will eventually take the representative point of the system 

to every attainable point in phase space. Since the number of points is assumed 
to be finite, a solution must be reached in finite time. 


Of the three reinforcement procedures which have been shown 
to guarantee solutions in elementary perceptrons (error correction, random- 
sign correction, and random perturbation correction procedures) the first 
is clearly the strongest, and can be expected to converge most rapidly. The 
random perturbation procedure will converge most slowly, since it must 
hunt through a large domain of the phase space of the system before achieving 
a satisfactory terminal state, and is not guided during this process by any 
directional constraints. In this respect, it shares many of the difficulties 
of Ashby's homeostat (Ref. 3); but it shares the virtue of the homeostat as 
well, that if the solution space is attainable, it will utlimately arrive at a 
solution no matter how complicated its functional representation may be. 
The random sign and random disturbance procedures may prove to be of 
interest in biological models, since the only information required for the 


control of reinforcement is whether or not an error has occurred. 
In practice, it will be seen that a gamma system (Definition 38, 
Chapter 4) generally works at least as well and sometimes better than an 


alpha system. Nonetheless, the following theorem indicates that this 


system lacks the true universality of the alpha system. 
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THEOREM 8: Given an elementary 2% -perceptron, a stimulus 
world W , anda classification C(W) , it is possible 
that a solution to C(W) exists which cannot be 
achieved by the perceptron. 


PROOF: Let each A-unit be activated for at least one stimulus in W , 
and let each stimulus activate a disjoint set of A-unitse. Let the classification 
function C(W) be one which assigns every stimulus to the same class, either 
positive or negative, A solution clearly exists, if the values of all connections 
are positive (or negative, as required by the classification). But if the initial 
state of the system is one in which all values are zero, or of the wrong sign, a 
solution can never be achieved by the gamma system, since a solution requires 
that the total value of each set A; of units responding to S- , and 
consequently the total value over the entire A -set, should agree in sign 

with the classification. In the gamma system this is impossible, since the 
initial sum of the values is constant. The conservative property of the gamma 
system gives it one degree of freedom less than the alpha system, making it 
impossible to achieve a solution to such problems unless at least one surplus 


A-unit (which does not respond to any stimuli) exists. 


The two remaining theorems were proposed by Joseph (Ref. 42), 
and establish useful diagnostic procedures for determining the existence of 
solutions in both alpha and gamma system perceptrons. As in Theorem 3, 


the activity function of the A -unit @; is defined as 


l if Q; is active for Ss? 


a; (S;) = 


0 otherwise 
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For any 7” -vector, X , with components x4 , the bias number of a.- 


with respect to X is defined as 


bX) = Dy xq a7 (Sg) 
4=/ 


This quantity is clearly related to the bias ratio (defined in 5.4) if X is 

taken to be the class-assignment vector for the » stimuli. We will denote 
by X* any 7 -vector X whose components x, do not disagree in sign with 
the required classification, C(W), i.e., x; 2>O if S; is in the positive 


# 


class, and X;£O _ if S; is in the negative class. X will denote a 


vector in which the inequalities are strict (no zero components). 


THEOREM 9: Given an Od -perceptron, and a classification C(W) , a 
necessary and sufficient condition that the error correction 
procedure reach a solution (in finite time, with arbitrary 
starting point) is that there exists no non-zero ¥X ¥ such 


that 4;X* =O forallé. 


PROOF: For conveneince, an un-normalized G-matrix will be assumed. 


For such a matrix, 


a 

a» 

| 

3 
< 

& 

I 


= me a: (S;) a; (S54) 


where 1-4 is the number of A-units in the set responding to both S; and Sy 


Hence, for any /” -vector X , 


X°G6X = >. X;7agje = » xX 8 a (s;) ai (S4) 
vy) yys 
But 


2 
Yow) =F be a; 53) = 2) x; x4 al(5;) al(Sa) 


‘ ‘ yy 
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Hence X'GX = PAC a] 


If the condition of the theorem holds, then X‘G¥ #0 for 
x=x* , xX # O . But from the proof of Theorem 4, it can be shown 
that x’/GX 2alx|‘for x” =x* , where @>0. Then the proof of the 
correction procedure in Theorem 4 applies, anda solution exists, so that 


the stated condition must be sufficient. 


If the condition does not hold, then there is a non-zero ** 
such that X’GX¥ =0 . Since G is positive semidefinite, this implies that 
xX'G =O. Thus, X_ is orthogonal to all the columns of G , and hence 

to any linear combination of the. columns of G . Since for an arbitrary 
vector Z , GZ is a linear combination of the columns of G , GZ is 
orthogonalto X . x* cannot be orthogonal to any vector J in which 
the signs of all u; agree with C(W), and hence it follows that there cannot 
exist vectors Z and ( suchthat GZ =U . This meas that there 
exists no solution to the classification problem, so the condition given must 


be necessary. Q.E.D. 


COROLLARY: For an oc -system, the condition that there exist no 
non-zero vector X* such that bx * =O for alle 
is equivalent to the condition that there exist Z and 
U such that GZ=VU (where / is in the same orthant 
as C(W)). Alternatively, this condition is equivalent 
to X'GX #0 forall non-zero X* . 
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THEOREM 10: Givena 7 -perceptron, anda classification C(W), a 
necessary and sufficient condition that the error correction 
procedure reach a solution (in finite time) is that there 
exists no non-zero xX” such that b; X "ic 
for all ¢ 


PROOF: For the 7 -system, the normalized G matrix consists of 


elements 
/ + Ps f 
Die = "A Wy Ue D a: (S;) a; (S4) - 7 2, 2i(5;) an (Sa) 
é th 


It is readily seen that G is symmetric. For any 7 -vector X , X’GX 
is given by 


X'GX = >. x; XA G4 
J? 


Ie 


/ 
>, xX; X4 ai(S;) a:(S4)- Na ) ,% Xa ai (S;) ay (Sg) 


inj, hying 8 
We now define b*(x) as 
b*’x) = gies P b- (X) 
Nz & 


From this, we see that 
” - 2 f . 
a [4; (xX) - b (x)| = nas [2, (0) -No Ee ZL 6: 


/ 
2, xj Xe AN(S;) a3 (Sg)- Wa DX 4 a2 (5;) 25 (54) 
yyy Aid, 
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Hence X'GX = a Ex@ = etn] : 


From this it follows, first of all, that G is positive definite or positive 
semidefinite, as was the case for the oc-system. Secondly, it is seen 
that X’GY¥ =O ifandonlyif 4;(X)=c forall ¢ . The proof now 


proceeds exactly as in Theorem 9. 


COROLLARY: Fora 7'-system, the condition that there exists no 
non-zero vector xX* suchthat 4:7” = c for 
all ¢ is equivalent to the condition that there exist Z 
and CU suchthat GZ=U (where J is in the same 
orthant as C(W) ). 


In practice, it is often possible to show that a given perceptron 
does not permit a solution to a given classification problem by substituting 
the classification vector itself, C(W) , for the vector x* in the above 
theorems, and computing the 4- . If these turn out to be zero for all 
A-units, then no solution exists for either the alpha or gamma system. If 
they are a constant other than zero, a solution. may exist for the alpha 
system, but not for the gamma system. If they are not all identical, then 
a solution may exist for either system. While it is sufficient to take the 
components of x* to be integers, the vector with all components 7; =f] 


is not always sufficient. For example, if the ag(S;) matrix is {1 1 1 
11 1 
1 12 1 


the 4; will all be anihilated by XX = (1, -2, 1), but not by X = (/,-/,/). 
The condition for the oa¢-system is equivalent to the requirement that there 
should be no vector in the same orthant as C(W) which is orthogonal to the 


linear manifold spanned by the activity vectors of the A-units. 
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6. Q-FUNCTIONS AND BIAS RATIOS IN ELEMENTARY PERCEPTRONS 


Thus far, we have been mainly concerned with the general 

"qualitative" properties of elementary perceptrons. In the present chapter, 
the groundwork for a quantitative analysis of their performance will be 
presented. In the theorems of Chapter 5, it was shown that the existence 
and attainability of solutions, in an elementary perceptron, depends strongly 
on the properties of the G -matrix. Each element of this matrix, 9, jot 
is a measure of the generalization of reinforcement from stimulus 5° to 5S; 
This generalization coefficient, 9- oes varies with the measure of the set of 
A-units which respond jointly to S$; and S$; . Until now, the actual 
quantitative measures of these sets have not been taken into consideration, 
and only the formal properties of the matrix G have been considered. The 

Q -functions, which are introduced in this chapter, represent the probabili- 
ties that an A-unit in a specified class of perceptrons will respond to a 
particular stimulus, or will respond jointly to a designated set of stimuli. 
These Q -functions not only determine the expected values of the generali- 
zation coefficients, Gij but enter into the analysis of variability of 


perceptron performance as well, as will be seen in the following chapter. 
6.1 Definitions and Notation 


The Q -functions, defined below, are always specific to a 
particular class of perceptrons in which the origin point configurations of 
the A-units have been selected according to some designated set of rules 
from a specified S-set or retina. The functions @ are defined only for 
, which are said to be active if the algebraic sums 


simple A-units, a; 


of their input signals, c- , are equal to or greater than their thresholds, 


e 
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@; . For such A-units, @ represents the probability of drawing an 
A-unit at random from the specified distribution which responds to each of 


a specified set of stimuli. The notation employed is as follows: 


Q@; = probability that an A-unit in a specified class of 


perceptrons responds to stimulus 5S; . 


Q: pe probability that an A-unit in a specified class of 
perceptrons responds to stimulus S; and also to 


stimulus ‘S$ j° 


Q; em = probability that an A-unit in a specified class of 


perceptrons responds to each of the stimuli S,, S;; ie 


6.2 Models to be Analyzed 


Three types of models will be considered which differ in the 
rules by which connections are made between S-units and A-units. It turns 
out that for the three cases, the distribution of input signals to the A-units 
is expressed in terms of binomial, Poisson, and normal random variables, 
respectively. These models are therefore named binomial,Poisson, and 


Gaussian models. 
6.2.1 Binomial Models 

In a binomial model the input signal, oc; , received by 
unit @; , is distributed as the difference of two binomially distributed 


random variables. This model characterizes a type of perceptron in which 


each A-unit receives a fixed number of connections from the "retina", 
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Figure 6 ILLUSTRATION OF TYPICAL S TO A-UNIT CONNECTIONS (ARROWHEADS 
INDICATE RANDOMLY SELECTED TERMINATIONS). IN GAUSSIAN MODELS, 
THE VALUES OF THE CONNECTIONS (SHOWN HERE AS + /) ARE NORMAL 
RANDOM VARIABLES. 
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consisting of exactly 2% “excitatory” and y "inhibitory" connections. Each 
of the excitatory connections has the value +1, and each inhibitory connection 
has the value -1. The threshold, @ , is assumed to be fixed for all A-units. 
The origins of the connections to an A-unit are selected independently, with 
uniform probability, from the entire set of S-units (or retinal points). 
Specifically, a set of equiprobable origin configurations can be constructed 
as follows: Let there be » connections, numbered from 1lto VY . Let the 
S-units be numbered from 1to A, . Then the set of all possible sequences 
of Y integers, each having a value in the range /<n € N, corresponds 
to the complete set of A-units. In this model, the number of distinguishable 
A-units possible for a retina of , points is ‘ee ag 7 (4 ~ ') i 


In the binomial model, Q functions do not depend on the number 
of sensory units, but on the fraction of them which are illuminated. A variation 
of this model has been analyzed in Ref. 79, where the additional constraint is 
introduced that no two connections to a single A-unit can originate from the 
same S-unit. It has been shown that for moderately large numbers of S-units, 
this model is practically indistinguishable from the true binomial model 


described above. 
6.2.2 Poisson Models 


In a Poisson model, c; is distributed as the difference of 
two Poisson-distributed random variables. In this model, it is assumed 
that the number of input connections to an A-unit is not fixed, but is a 
random variable. The model corresponds to one of two situations, the 


equations for the Q -functions being identical for both: 


The derivation of this formula can be found in Feller, Ref. 21, page 52. 
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(1) In the constrained origin model, each S-unit emits a fixed number of 

output connections, consisting of ¥, excitatory, and vy inhibitory connections 
(with values +l and -1, respectively). Terminal points are selected at random 
froma setof NV, A-units. For the model to hold exactly, NV, and N, 
should both be infinite, the ratio N, J N, being a parameter of the system. 


For finite 4, and A, , the model remains a close approximation. 


(2) In the random origin model, a set of V, excitatory and Ny inhibitory 
connections are each independently assigned an origin and a terminus at 
random, from a set of S-units and A-units, with uniform probabilities. In 
this case, for the model to hold exactly, the numbers N,, Ny and Ng 
should all be infinite, with Meee being a parameter of the system; 


as in the previous case, however, the model is a close approximation for 


finite systems. 


In the Poisson model, for Case (1), the number of possible A- 
units is (+, +/) No (Py #1) Ne . For Case (2), the number of 
possible A-units is (Ny 1) Na (Ny + 1) N@ | The binomial model, the 
constrained-origin Poisson model, and the random-origin Poisson model 
yield increasingly large sets of possible A-units, for the same numbers of 


S-units, A-units, and connections. 
6.2.3 Gaussian Models 


In the Gaussian case, oO ; is distributed as the difference 


of two normally distributed random variables, i.e., ao: is normally 


é 
distributed. While both of the above cases converge to a Gaussian model 
as the number of input connections to an A-unit becomes large, we shall 
be concerned here with a model in which the number of connections remains 


finite, but the values of the connections are normally distributed. 
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6.3 Analysis of Q; 


For both the binomial and Poisson models, @; _ , the probability 
that an A-unit is activated by stimulus S; , is given by the probability that 
the total input signal o¢ is equal to or greater than the threshold, 9 


Specifically, 
(6.1) 
Emax E-@ 
Q; = DL Pla) =) %&(E) P(t) =), D2, P(E) P, (2) 
(a) E-I20 E=#6 I=0 


where x for binomial model 
Emax 
co for Poisson model 


P(E) = probability that exactly E of the excitatory connections 


to an A-unit originate from active S-points. 


probability that exactly I of the inhibitory connections 


P,(I) 
to an A-unit originate from active S-points. 


For the binomial model, 


P(E) = x) asf jeje 
* (2 OB (6.2) 


Py(I) = 2) R.7(1- Re) 4-7 


where /&-, = fraction of retinal points (S-units) activated by stimulus S; 
For the Poisson model, 


(eR; 2)" -~R;% 


P(E) = ei e re, 
_ (RG). re 7 
P, (£) = apt a 3 
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expected number of excitatory input connections 


XI 
I 
= 
re 
st 
' 


where 
to an A-unit. 


expected number of inhibitory input connections 


<| 
| 

< 
oe 
ul 


to an A-unit. 


P(x) for the Poisson model can be expressed alternatively by 


the following identity (pointed out by Prof. H. D. Block): 


mie aS OC: 
P(e) = pf{(e-<) = of = aaa es “Fo(2e, yzy) 


Where J y(*) is a Bessel function of an imaginary argument, given by 


The use of this equation makes it possible to compute Q -functions 
for the Poisson model by hand, with the aid of tables of Bessel functions (c.f., 
Ref. 37, pp. 224-233). 


For the Gaussain model, equation (6.1) requires an additional 
factor representing the distribution of value for each of the connections. 
Specifically, if the absolute values of both excitatory and inhibitory connections 


are distributed with mean , and standard deviation o , we have 


ere er - 
Q; = 2. 2, P(E) Py(I) | $(¢,7) dD 
Ex0 TI=0 (6.4) 
D=86 
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where 


ap = Ey -Ly, 


of = (E+) 67 


P(E) and Py (I) , in equation (6.4) are given either by (6.2) or (6.3), 
depending on whether the number of input connections to an A-unit is fixed 


(as in the binomial model) or random (as in the Poisson model). 


Figures 7 and 8 show representative families of curves for Q: 
as a function of &; , for the binomial and Poisson models, respectively. 


Note that both models are very similar in their basic characteristics. 


Specifically: 

1. In all cases, for R; <-5 andx2yY, Q; increases monotonically 
with &; 

Zs For purely excitatory models Cy =O) @ goestol.Oas R; 


approaches 1.0. (Figures 7a and 8a). 


3. For models with @>xX-y, @; goes to zero as &; approaches 1.0. 
(Figures 7b and 8b). 


4. For x=y , Q; tends to remain invariant except for very small or 
very large values of &; . The range over which Q; tends to 
remain constant is increased if the number of connections becomes 
large (Figs. 7c and 8c). Inthe limit, with small @ and large Zz 
and y , @; approaches .5 for all values of &; except 0 and 1. 
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5. Keeping x fixed, then for small 6 , @; is generally greater 
for the binomial model than for the Poisson model. For large 6, Q; 


is greater for the Poisson model. 


6. For the binomial model, Q; = O for x <0 while for the Poisson 
model, Q;=0 onlyif Z=#0. 


6.4 Analysis of Q;; 


Q; ss is the probability that an A-unit is activated by each of 
two stimuli, 5; and S poe For both the binomial and Poisson models, Q; : 


can be expressed by the equation: 


(6.5) 
Or; = , j P,(E;, E;,E,) Py(I;, T;,22) 
Est &.-T:-L,28 
EE; + E,-I;-1,29 


where @ = threshold of A-units 


&- = number of excitatory connections originating from points 
illuminated by S; but not by S; 


E—E- =z number of excitatory connections originating from points 
illuminated by S; ‘but not by 5; 


£. = number of excitatory connections originating from points 
common to S; and S; 


I; = number of inhibit ory connections originating from points 


illuminated by S; but not by 5S; 


I; = number of inhibitory connections originating from points 
illuminated by 5; but not by S; 


J. = number of inhibitory connections originating from points 


common to 5; and S; 
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The point sets involved in the analysis of Q; j are illustrated in Figure 9. 
For the binomial model, the required probabilities are given by the multi- 


nomial equations: 


x! E>. E—: £& Y-E-- E--£ 
F i : = OO e é Pe Jd yo — a i 4 J L 
x (E51 EF, EQ) E,1616,1(%-& -6-E,)! Ag AOE (1-A, A; Cc) 
. (6.6) 
: Ep, Tp piay ie 
P, (Ip, L;, 1g) = =; ‘A; 4“C £(1-A;-A;-C)7 be f; I. 


FzI Ti IT! Cy ae eae =i)! ‘ 


where C = _ proportion of retinal points illuminated both by S$; and S$ yp 


> 
rT 


i = R:-C where &,; is the proportion of retinal points illuminated 
by S; ; ’ 


»> 
" 


|; =®;-C where &: is the proportion of retinal points illuminated 
by S: 


_ For the Poisson model (where x and y are the expected numbers of 


excitatory and inhibitory connections to an A-unit), 


-f ~xXA;,_ Ei -xA:; ae = 
PLE: Ej Ee) = (Et Gl Egt) GA ne Neea ie *(aC)% og 


-1 -gA:,_. =: -GA;, ©: -9¢,_ 2 
R(ZpLjs Ze) = (Zl Tllet) +e WGA) te MGA) %-e ogc) * 


As in the case of Q; , the Gaussian model for A requires 
an additional factor representing the normal distribution of connection values. 
The components of the input signal, o¢ , which originate from the unique 
S-units in $; , the unique points in $; , and from the common retinal 


set are designated OD; , D; ,and O, , respectively. By analogy to 
(6.4), 
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= (&-%) 
O, = (€,-L,) 


2 2 
Sp, =(Eprlp)o 


$(Dy) = 0 (Dg, ry) » defined as in (6.4). 


@:; = > Pre CE y Ej Eq) Py(SjrTj rT.) (6.8) 
{€;,£;,£,52;57;,2,} 


/ / $(0,) $(0:) H(D;) dd, LD; dd; 


&. -- 6 0;- 6-0, cof *6-Dz, 


For some purposes, the distribution of the input signals, o; 


of interest. The joint probability, P(e;,0;) , is given by 


», and oC; is 


oo 
6. 
nb M a5) | Plea 0.) Hou; -O,)dD, ‘°°? 


D, = - 0 


Fy 
{6:58} 1,124 52; Tg} 


It should be noted that @; = @;; is a special case of these equations, for 
which A; = A; =» C¢ . Tables of @:; for binomial and Poisson models 
have been published in Ref. 87. 


Figures 10 and 11 illustrate the quantitative properties of A 


as a function of ¢ , the measure of the intersection of stimuli $; and 5S; 


on the "retina'’. For convenience of representation, Q; 


: is actually plotted 
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as a function of the relative intersection (or proportional intersection), C v4 R , 


R; and R; being equal for all cases shown. Note that for C/R =f , 


Q;; = @;; = Q; » The main features of these curves are: 
i. In all cases, Q;; increases monotonically with C 
2. For large 6 , Q;; tends to remain close to zero, except for 


stimuli which approach perfect identity ( C/R close to 1.0). 


3. For large values of f , Q@;; tends to accelerate more rapidly 


as C approaches 1. 


4. For the binomial model, Q;; for disjoint or well separated stimuli 
( C =» 0 )may have a maximum with respect to 2 . This effect 


is not found in the Poisson model. (Figs. 10c and llc.) 


5. For equivalent parameters, Qi; tends to show a sharper ''shoulder'"' 


in the binomial model than the Poisson model. 


The second of these properties is an important factor in 


determining the discriminative capability of a perceptron. It is shown best 


in terms of the conditional probability, Q;| j > that an A-unit which responds 
to $; also responds to S$; . Qf; is equal to Q:5/ 2; , and is shown for 


several typical cases in Fig. 12. Note that for large values of @ , the 
probability that an A-unit responding to S$ ; responds to a second stimulus, 


S- 


‘ 
difference between the binomial and Poisson models is shown most clearly 


» is virtually zero, unless the stimuli approach perfect identity. The 


in Figures 12(a) and 12(b). Figure 12(c) demonstrates that the conditional 


probability depends only slightly on stimulus size. Additional curves for 
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these functions can be found in Refs. 79 and 80. 


In analyzing the gamma system, it will be seen that the 
conditions under which @;; = Q;Q; are of particular interest, since. for 
the gamma system the expected value of g iy is zero for such conditions. 
In the binomial model, @;; = Q9;Q; if C #&;R; . This condition 
will tend to be met if the stimuli are randomly chosen sets of S -points, 
the expected intersection of any two such sets being equal to the product of 
the measures of the sets. It can readily be seen that under these conditions, 
the probability that an origin point which is in S; is also in S; is the same 
as the probability that an origin point which is not in S; happens to bein S; ; 
in other words, the probability that the origin of a connection is in Ss; does not 
depend on whether or not itis in §; , and consequently the response to $; 
is independent of the response to §; , yielding Q:3 *GaQG- In the Poisson 
model, however, Q:7 = 9:9; only if CeO (i.e., for disjoint stimuli) since 
the connections received from any disjoint subset of S-units are independent 


of connections (or signals) from any other subset. 


6.5 Analysis of Q:i4 


In the following chapter, it will be seen that the expected responses 
of a simple perceptron can generally be determined from the functions Q; 
and @Q; j «+ The variability of performance in a class of perceptrons, how- 
ever, will be seen to depend on the joint probability, Q; 7 that an A-unit 
responds to each of three stimuli, S; , Sj; and S4- The. equations are a 
straightforward generalization of those employed in the last section for Qj 
Specifically, there are now seven excitatory and seven inhibitory signal 


components to be considered: 
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E; = 


Excitatory signal from S-units responding to S; 
but notto S; or Sx 


—&; = excitatory signal from S-units responding to § j 
but notto §; or S4 

£4 = excitatory signal from S-units responding to Sy 
but notto S$; or S$; 

E; +e excitatory signal from S-units responding to S; 
and S; but not Sx 

E-4 = .excitatory signal from S-points responding to S; 
and Sy° but not S; 

Eis = excitatory signal from S-points responding to S; 
and Sg but not S$; 

E; A excitatory signal from S-points responding to all 


three stimuli. 


Inhibitory components are defined analogously. This yields the equation: 


Qj Dae Pe lEp Ess Fer Ej Ech sE ja) Py GZ LasDj TetrTparTija) 


where 


“aj2 ea 
a; 2 6 
YH 206 


a ME, Ej + Eig + Epig - Tj - Tip - Tig Tipe 


ou; = &; + E;; + Eig + Ty) - q; “Tip Lar lis 


oy MEG Ey tg t Eig -Ty-Tyy- Lae ti; 
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(6.10) 


The multinomial and Poisson probabilities employed in (6.10) for the 
binomial and Poisson models, respectively, are obtained by extension 
of (6.6) and (6.7), with appropriate measures for the various double and 


triple intersections among the stimuli. 
6.6 Bias Ratios of A-units 


Bias ratios were defined in Section 5.4 as the ratio of the 
number of stimuli in the positive class to the number of stimuli in the 
negative class, which activate an A-unit. In Theorem Z, it was shown 
that there must be some variation in the bias ratios of the A-units ina 
perceptron, if a solution to a given classification is to exist, and Theorems 9 
and 10 showed that the closely related "bias numbers" yield necessary and 
sufficient conditions for solutions. Clearly, the distribution of bias ratios 
depends on the probabilities Q, joom =, that the A-units will respond to 


various possible sets of stimuli, S- , 


; -+, S,,- Rather than undertake 


oe 
a detailed analysis of bias ratios, empirical data are presented for a typical 
case, to illustrate how we might expect the ''responsiveness'' of A-units to 
different classes of stimuli to be distributed. These data were obtained by 
a Monte Carlo procedure, in which 10,000 A-units were tested on a digital 


computer to determine to how many stimuli of each class they responded. * 


The program, was written by A. Geoffrion, for the Burroughs 220 
computer at Cornell University. 
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The "retina'' consists of a 20 by 20 mosaic of S-units , and the stimuli con- 
sist of 4 by 20 bars, placed vertically or horizontally on the retina, in all 
possible positions. The retina is assumed to be toroidally connected, so 

that bars placed near one edge of the field may re-enter at the opposite 

edge. Thus, there are twenty possible horizontal bars (the positive class) 
and twenty possible vertical bars (the negative class). This universe will 

be used as a standard one in a number of learning experiments .to be 

analyzed in the following chapters.” Table 1 shows the number of A-units 
out of 10,000 responding to each possible combination of N* horizontal bars 
and N vertical bars. An A-unit which responds to 4 horizontal and 6 vertical 
bars, for example, is tallied in the 5th row and 7th column of the table. Each 


A-unit had five excitatory and five inhibitory connections, and a threshold of 2. 


For stimuli which are more similar to one another (in terms of 
possible intersection of S-sets) than horizontal and vertical bars,- we would 
expect to find the A-units less well distributed and a greater concentration 
around the diagonal. One would also expect that in a universe in which the 
stimulus classes are less symmetric in their properties, the distribution 
of A-units would be less symmetric than that shown in Table 1. Table 2 
illustrates both of these features. In this case, the "positive" class 
consists of 4 by 20 horizontal bars, just as before; the ''negative"' class, 
however, consists of a set of 6 by 20 horizontal bars. Again, there are 
twenty members of each class, but the maximum intersection possible between 
stimuli of the positive and negative class is much greater than before, and the 


size difference introduces an asymmetry which was not previously present. 


*% 
The toroidal retina has the convenient property of being unbounded and 


isotropic, with a finite surface. Any relations which hold for a set of 
stimuli projected onto the retina hold equally well if all stimuli are 
displayed by any combination of horizontal and vertical translations. 
This model (with Born-von Karm4n boundary conditions) is easier to 
analyze than a spherical retina which has similar properties. 
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TABLE | 


JOINT DISTRIBUTION OF 10,000 A-UNITS, WITH RESPECT TO NUMBERS OF 
HORIZONTAL BARS AND NUMBERS OF VERTICAL BARS TO WHICH THEY RESPOND 


N 
\ (VERTICAL BARS) 


0 2 g 4 5 6 
0 830-287 326 0 ougtié8dD 37% 6s 97 

315 «(892 378 = s«876—ti(‘é«é 71 30 

gh 2 325 WI? 399 351 92 a7 

3 328382 308888 943 9 8430-37 

(HORIZONTAL BARS) =, 330 361 368 340 305 68 24 
5 68 87 79 a 85 27 a 

6 $2 36 38 97 26 ? 2 

7 6 9 7 ? 6 4 0 

8 2 0 2 | i 0 

TABLE 2 


JOINT DISTRIBUTION OF 10,000. A-UNITS, WITH RESPECT TO NUMBERS OF 
& x 20 AND 6 x 20 HORIZONTAL BARS TO WHICH THEY RESPOND 


N 
(6 x 20 BARS) 


0 ! 2 3 q 5 6 7 8 9 10 
0 917 436 224 47 i ! 0 0 0 0 0 
t 277, 72% = S807 8388 86 a4 § 0 o o 0 
2 98 250 $72 539 370 119 51 5 3 0 0 
Nt 3 16 63 191 622 534 424 166 17 § 2 0 
(% x 20 BARS) % ' 10 40 162 380 0=— «583 602 67 9 8 0 
§ 0 0 tt 23 50 133 158 59 22 5 i 
6 0 0 0 3 VI 22 a8 28 24 4 0 
7 0 0 0 0 0 3 | it 10 8 ! 
8 0 0 0 0 0 0 0 I ( | t 
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While the joint distributions illustrated here are not of great 
utility in analyzing perceptron performance, they provide considerable 
insight into what takes place within the association system when a perceptron 
learns a classification of stimuli. Units situated on the diagonal (i.e., units 
which respond equally to both classes of stimuli) are essentially ''duds"'; they 
contribute little to a discrimination, and are as likely to be reinforced 
positively as negatively. A-units which have a strong bias towards one class 
or the other, however, (those situated in the upper right or lower left corners 
of the tables) are useful "discriminators". In learning a classification, the 
perceptron relies on combinations of such units, transmitting large-valued 


signals, to establish a bias towards the proper class when a stimulus appears. 
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as PERFORMANCE OF ELEMENTARY oc -PERCEPTRONS IN 
PSYCHOLOGICAL EXPERIMENTS 


So far, only the formal properties of elementary perceptrons 
have been analyzed, without regard to particular experimental situations 
or procedures. We are now ready to begin a quantitative analysis of the 
performance of these systems in ''psychological" experiments, i.e., 
experiments in which the procedures and observations are analogous to 
those which might be performed on a biological organism. A number of 
such experiments were defined in Part I, Section 3.3. In this chapter, we 
shall be chiefly concerned with discrimination experiments (c.f., Section 3.3.1), 
since the capabilities of elementary perceptrons are largely limited to this 
category. Before going‘on to other types of systems, however, we will 
consider what kinds of behavior might be expected of an elementary 
system in generalization experiments, figure detection experiments, and 
other problems which were discussed in Chapter 3. The analysis of 
discrimination experiments which is reported here is basically similar to 
that which was originally presented in Ref. 79. The former models have 
been substantially simplified, however, and the analysis has been made 


more rigorous, thanks largely to the work of R. D. Joseph, (Ref. 41). 


7.1 Discrimination Experiments with S-controlled Reinforcement 


The first problem to be analyzed is that of a discrimination 
experiment in which the perceptron is presented with a sequence of stimuli 
from an environment, W , and is reinforced for each stimulus in the 
sequence in accordance with a predetermined classification, C(W) , with 


the reinforcement control constant, 7 , taking the sign of the required 
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response. The perceptron is then shown a test stimulus (S,) and the 

response to this stimulus is determined. The measure of performance for 

a class of perceptrons (characterized by the parameters V, , 9 , xX , and 

y fora binomial model or by Ns /No » @ , Xx ,and y for a Poisson model) 
is the probability that a perceptron from the specified class will give the 
correct response to S, after having been "trained" with the specified 


sequence of stimuli. 


7.1.1 Notation and § ols 


>; = the 7 S stimulus in the environment 
Bua +1 if 5; is in the positive class 
. -lif 5; is in the negative class 


1 if the PUP ciate is active for Sis Sgoecey and Sy 


as (j4..x) = 


0 otherwise 


Ea:(j;4..x) = probability that a7(j4..z) = 1 
(as defined in Chapter 6) 


a 
mh 
a 

t 


7, = duration (number of stimuli) of the training sequence 


value of the connection from the :¢ th A-unit after the 


5 

% 
~ 
= 
" 


training sequence 


ies (x) = Lip (x,T) = a(x) v,(T) = signal received by the 
R-unit on connection <,, 
when test stimulis 5, is 
shown after the training 
sequence. The time 7 will 
be understood unless other- 


wise specified. 
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Uy = op, (T) = » c-() = total input to the response unit when S, is shown 
é 
after the training sequence. For present purposes, 
the symbol u, will be used, as in Chapter 5. Time 


7 is understood unless otherwise specified. 


In terms of these symbols, the reinforcement rule for a quantized 
ac -system, with S -controlled reinforcement, can be represented by the 


following expression for the change in 27;, when stimulus S; is shown: 


* 
AVip = 4%; (i) 


7.1.2 Fixed Sequence Experiments: Analysis 


The first case to be considered is that of a fixed training sequence, 
in which a definite sequence of stimuli ( Sp» Sys-e+es Sz ) is shown to the 
perceptron. In a later section, random training sequences will be considered. 
The fixed sequence consists of a fixed (though not necessarily equal) number 
of showings of each stimulus. For c-perceptrons, the order of occurrence 
of these stimuli does not affect the results. All values 7, are assumed to 
be zero initially. The following analysis and theorem follow the treatment 
of Joseph (Ref. 41). 


If a given perceptron is shown a training sequence, it will place 
a test stimulus: 5, in the positive class if u, is greater than zero, and in 


the negative class if uw, if less than zero. For the given perceptron, 


training sequence, and test stimulus, u, is a determinate number. 
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Over the class of perceptrons, however, “, is a random variable. 

In order to determine the probability that a perceptron from the specified 
clase will classify S, correctly, we must know the probability that u, 
has the correct sign. In order to obtain a conservative bound on the 
probability of correct response to S$, , without making any assumptions 
about the distribution of «& a Joseph makes use of the Tchebysheff 
inequality, which states that for any random variable 7 with mean ,. 


‘ 2 
and variance 7 ‘; 


Prob (9 > Of 2 '- sar if Yt >O 
pt 


! 


Prob 91520) Sia if p20 


Consequently, if the ratio wuxz)/o =(ag) can be made arbitrarily large, 
the probability that ., for a randomly selected perceptron will agree in 
sign with its expected value over the class of perceptrons can be made 
arbitrarily close to oie It thus becomes important, first of all, to know 


whether or not the expected value of u, has the proper sign. 


2 
Joseph, has pointed out that if the one-sided inequality °{;-u2 1 <> 


fea 
is used in place of the two-sided inequality >, {|z-u|/2/}< «7, 
slightly sharper bounds may be achieved, i.e., 


/ 
Pn{g > Of 6 1-755 eF if “>O 
*,{3< 0}€'- Tye if <0 


In the range of interest, this additional sharpness is insignificant.. 
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DEFINITION: Sz will be called a positive stimulus (with respect to a 
class of perceptrons, an environment, classification, and training sequence) 


if the expected value of u, agrees in sign with the assigned class of S, 
In terms of the symbols introduced above, S, is a positive stimulus if 
Az E(uz) > 0 


The expected value of uw, foran oc -perceptron (assuming 
that all A-R unit connections start out with zero value) is obtained as 
follows. Let P; = the number of times stimulus § j occurs in the 
training sequence, divided by 7 , the total number of stimuli in the 
sequence (i.e., the proportion of the training sequence whichis 5S yj) 
Then the value of the connection from unit @; at the end of the training 


sequence will be (since the magnitude of 7 is taken to be !) 
» 
vie = 7D 2 FG) iat 
J 


where the sum is over all stimuli in W . Consequently, summing over all 
A-units, the input signal to the response unit when the test stimulus Sy 


occurs will be 


ay = rh a P. ai (jx) = 2, kip (2) (7.2) 
J 
The expected value of u, is therefore given by 
fh Laneiua} 
TL Ae eee 


T Ng La PB Qx 


Eu, 


(7.3) 
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From the above definition, it follows that S, is a positive stimulus (and 


will tend to be correctly classified) if 
Ze Aj Az Pi Yx > O 
d 


From Equation (7.3) it is clear that £u, increases linearly 
with WN, . Let us now consider the variance of uy, . This is obtained 


from the equation: 


7 (uz) =) ci, cae). he ae) <i'p (x)] Wear 
; 6 i Pe 


For the conditions currently being considered (an o¢ -system with a 
predetermined training sequence) the only source of variability in bi (x) 
is in the selection of the origin point configuration of the unit @; . But if 
we assume (as in all models thus far considered) that the A-units are all 
chosen independently from a distribution of admissible origin configurations, 
the covariances will all be zero, and o*(c7. (0) does not depend on «¢ 
Therefore, the general equation (7. 4) reduces to 


a 
o *(u,) = Noo *(eip (x) = Ny Ee (x) - E*¢%. (x) | (7. 5) 


(See Rosenblatt, Ref. 79 , pp. 82-83, for a more detailed algebraic 


discussion of this equality). Now, foran o-system, 


This yields, for the required expected values in (7.5), 
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2 
E @ip (x) © 7?) DB Pa PP, Qdx 
and uy # 


E “ef, (x) = yy Bi Te Pj Pats Ox Caz 
d 


Substituting in (7.5) and simplifying, this yields 


o *(uz) = Nat) De, Ai ee BP, (Q5 4x ~ Qx Vax) (7.6) 


Note that the variance depends on Qs z » while the expected value depends 


onlyon Qj, . This variance, like the expected value, is of the order of 
Na We are now in a position to prove the following theorem (due to 

Joseph): 

THEOREM: Given a class of elementary oc -perceptrons, a finite 


stimulus world W , a classification C(W) , anda 
training sequence; then for every €>0O, there exists 
an N,(€) such that if N,>AN,(6) , the probability 
of selecting a perceptron which will correctly identify 
the class of every positive stimulus will be greater 


than /-€ 
PROOF: From the Tchebyscheff inequality, we have seen that if 
pe "(uz)/o (ux) can be made arbitrarily large, the probability 


that uy, will agree in sign with its expected value over 


the class of perceptrons will approach unity. 
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It has also been demonstrated (Equations 7.3 and 7.6) that both «(u,) 

and o* (ux) are of the order of A, ; therefore, a CP ee *(uy) 

will be of the order of NV, . Thus, for each positive stimulus, S, , 

the probability that u, agrees in sign with uy can be made arbitrarily 
clost to 1 by choosing WN, sufficiently large. Suppose there are 7 stimuli 
in W . Then, for the ae positive stimulus there exists a quantity N; (€) 
such that if NW, > N/ (é) , the probability of selecting a perceptron 
which fails to correctly identify S; will be less than €/n . If we let 

No (€) = ™9* N- (Ee) , the condition required by the theorem is satis - 
fied. Q.E.D. 


From Equations (7.3) and (7.6), it is seen that for a given set 
of stimulus frequencies Pp; the ratio pif? does not depend on 7 
Thus any number of repetitions of the same training sequence can occur 
without affecting the performance of the system. Since pif? varies 
linearly with Na , the normalized ratio ae Vy a 2 forms a convenient 
measure for the comparison of different perceptron models. Some numerical 


values for typical cases will be considered in the following section. 


While the above analysis permits us to obtain a rigorous lower 
bound for the probability of correct identification of S, by a randomly 
selected perceptron, it does not actually yield an estimate of this probability. 
In order to estimate the probability of correct identification of Sy » it will 


be assumed that «, is normally distributed. The justification for this 


x 
assumption was discussed in Rosenblatt, Ref. 79, and subsequent analysis 


has shown that the approximation is very close, even for perceptrons with a 
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small number of A-units. Assuming a normal distribution, we have for 


the probability of a positive response to S, 


P= P(zis,) = $e (7.7) 


Note that the above equations do not depend on whether the 
perceptron is constructed according to the binomial model, Poisson model, 
or any other other model, so long as the A-units are selected independently 
of one another. The performance does depend on the @Q -functions, however, 
which will be different for different models. From equation 7.3 it is clear 
that any stimulus S, will tend to be classified correctly if the average value 
of Ox for S; in the same class as S, is greater than the average value 
of Qix for $; in the opposite class from Sy, . (If the frequencies /; 
are not all equal, each Qrz must be multiplied by its appropriate frequency 
in obtaining these averages.) From the analysis of @ -functions in the 
preceding chapter, it is clear that this condition will generally be met if 
the stimuli of each class have large intersections with one another (on 
the retina) while stimuli from opposite classes have small intersections 
with one another. The ideal situation would consist of two disjoint clusters 
of stimuli, located in different parts of the retinal field, each cluster 
representing one class. In order to discriminate two stimuli reliably 
(i.e., to assign them to opposite classes) it is desirable that Qj for 
the two stimuli should be small, and particularly that the conditional 
probabilities Qj and @7|; should be as small as possible. Figure 10, 
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in the last chapter, shows that this condition can readily be met if the 
stimuli have a small intersection with one another, but becomes increasingly 
difficult to meet as the intersection increases. This figure also shows that 
a binomial model is better suited to the discrimination of similar stimuli 
than a Poisson model, where QJ is apt to be relatively large even 


for disjoint stimuli. 


7.1.3 Fixed Sequence Experiments: Examples 


The environment which was considered in the last section of 
Chapter 6, involving twenty horizontal bars and twenty vertical bars on a 
20 by 20 toroidally connected retina is a convenient one to use for a 
“calibration experiment", by which different classes of perceptrons can 
be compared. In particular, consider the following discrimination 


experiment: 


EXPERIMENT 1: Given a perceptron with 400 sensory points arranged in 
a 20 by 20 toroidally connected array, or ''retina'', let W consist of the 
twenty possible 4 by 20 horizontal bars, and the twenty possible 4 by 20 
horizontal bars, Let C(W) bea classification which assigns every 
horizontal bar to the positive class, and every vertical bar to the negative 
class. Show every bar in W to the perceptron exactly once (or ina 
sequence with P; equal for all stimuli). During this training sequence, 
the perceptron is reinforced with S -controlled reinforcement. Then 
select one of the bars, Sy ,» and determine whether the response is 


correct, according to C(W). 


-162- 


Google 


100. 
NUMBER OF ASSOCIATION UNITS (W,) 
Figure 13 PROBABILITY OF CORRECT INDENTIFICATION OF A TEST STIMULUS BY AN 


ELEMENTARY o¢-PERCEPTRON, IN EXPERIMENT | (CURVES ALSO APPLY TO 
7 ’-PERCEPTRONS; SEE CHAPT. 8) 


NUMBER OF ASSOCIATION UNITS (A) 


Figure 1% PROBABILITY OF CORRECT INDENTIFICATION OF A TEST STIMULUS BY AN 
ELEMENTARY o¢ -PERCEPTRON, IN EXPERIMENT 2 (FOR TWO BINOMIAL MODELS). 
CURVES ALSO APPLY TO 9° -PERCEPTRONS (SEE CHAPT. 8) 
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Table 3 shows the performance ratios, £7 Via » for a 100 
A-unit binomial model a¢ -perceptron, with various combinations of the 
parameters X and y (@ #2 inall cases ). The parameters y=3, 
y=? , @=2, appear to be optimum for this experiment, as can be 
seen from the table. (Increasing the threshold results in a definite drop 
in performance.) Figure 13 shows the performance of several binomial 
and Poisson model perceptrons as a function of N, , computed from 
Equation (7.7). The top curve shows the performance of the optimum 
(binomial) system. A comparison of the other two curves illustrates the 


relatively poor performance of the Poisson model on this particular problem. 


It should be emphasized that the parameters found to be optimum 
in this experiment will not necessarily turn out to be optimum in other 
environments, or other classifications. In general, it appears that as the 
classes of patterns to be discriminated become more ''similar", (i.e., as 
the maximum possible overlap between stimuli from opposite classes 
increases) the optimum number of connections to an A-unit and the optimum 


value of 6 tend to increase. 


A more difficult classification of the same dichotomy has been 


studied in the following experiment: 


EXPERIMENT 2: With the same environment as in Experiment 1, number 
the horizontal and vertical bars consecutively according to their position on 
the retina. Let the classification C(W) place all even numbered bars in 
the positive class, and all odd numbered bars in the negative class. The 


training and testing procedures are identical to Experiment 1. 
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TABLE 3 


PERFORMANCE RATIOS (Aled FOR 100-A-UNIT ELEMENTARY cx-PERCEPTRONS 


(BINOMIAL MODEL) FOR EXPERIMENT | (HORIZONTAL/VERTICAL BAR DISCRIMINATION, 
FIXED SEQUENCE). © = 2 IN ALL CASES. 


2% (NUMBER OF EXICITATORY CONNECTIONS PER A-UNIT) 


2 3 4 6 
0 2.878 2.831 1.680 931 
; 2.063 2.912 2.10% 1.309 
2 1.708 2.808 2.479 1.773 
ete OF 3 1.806 2.592 2.670 2.140 
pebhanitbeht 4 1.153 2.329 2.708 2.418 
in rt 2.006 2.680 2.879 
e 6 767 1.777 2.473 2.638 
? 623 1.523 2.271 2.608 

TABLE 4 


PERFORMANCE RATIOS FOR 100-A-UNIT ELEMENTARY oc-PERCEPTRONS 
(BINOMIAL MODEL) FOR EXPERIMENT 2. G@ = 2 IN ALL CASES. 


x (MUMBER OF EXCITATORY CONNECTIONS) 


2 3 4 5 
0 358 .426 328 274 
' 368 602 436 363 
y 2 362 551 526 45) 
(NUMBER OF = 3 350 .578 896 533 
iImmiBiToRY «4 333 685 686 . 605 
CONNECTIONS) § .310 .578 .677 . 668 
6 .285 558 . 690 .707 
7 .268 .629 . 688 .736 
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In this case, the two most similar bars to any test bar (those 
which overlap it by 3/4 of its area on either side) are invariably in the 
opposite class. Nonetheless, all stimuli may be positive stimuli under 
these conditions, with a suitable choice of parameters. Table 4 shows the 
ratio pls for a 100 unit system in this experiment. Figure 14 shows the 
performance of a perceptron with the same parameters as before (X=3, y =/, 
6-2) on this experiment, and also with the best parameters found to date 
(x =5, y=7, O-2). These parameters are the best set for £5 and ys7, 
but are probably not optimum, as it seems likely that a further increase in 


both X and y would yield a further improvement in performance. 


7.1.4 Random Sequence Experiments: Analysis 


For the analysis of the performance of perceptrons trained 
with random stimulus sequences, it is convenient to make use of an 
unnormalized G-matrix (see footnote, page 75), where #?=/ instead of 
1/No . For such a matrix, inthe Od -system, I; = the number of 


units active for both 5; and S; » or 


9;; ™ J, a4 (4) (7.8) 


The mathematical properties of the unnormalized G-matrix are no different 


from those discovered for the normalized matrix, in Chapter 5. 


In a random sequence experiment, the training sequence is 
assumed to consist of a series of 7 stimuli, in which each stimulus in 


the series is selected independently of the others. The probability of 
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selecting stimulus S$ j for the 2% position in the sequence is » ype 
forall ¢ . We will let ™ 7 = the number of times stimulus § 7; occurs 
in the training sequence. The random vector 7 =(m,, mg.-m,) will have 
a multinomial distribution with 7 trials and probability vector 

ry = (py, P: Tor) . The training sequence selected is assumed to be 
independent of the particular perceptron selected for a given experiment. 
At the end of the training sequence, the input to the R-unit in response to 
atest stimulus Sy, will be 


on LA; Gx 
J 


-L Lames) 


Therefore, the expected value over perceptrons and training sequences is 


El(uz) = TNs 2 4; ?; Ox (7.9) 
J 


which is of the order of 7N, . Note that this is identical to equation (7.3). 


The variance over both perceptrons and training sequences is 


given by 


o *(uy) rs 2,0 %(m; xi) 2D F; PA cov. (m; Gxj » ™y, 9x2n) 
J i At 
= 2 [elmf) £(98;) - 4m) EX 945)] 
J 


+e Z ke [Elm; ma) © (9x; 9x0) ~ Elm) Elma) E(9x;) E(9xa)| 
J J 
(7.10) 
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For the components of the multinomially distributed vector m we have 
E(m;) = Tp 
E(m-*) = T(T-1) p;7 + Tp; 
J a) Fj 
E(m; mg) = T(T- 1) P; M4 
Let ”,;.,_ ™ number of A-units active for stimuli 5; , Sjysssy Sy 


The symbol ~ over a subscript will be used to denote negation (e.g., 


nj;{ = the number of A-units active for stimulus 5; but not for Sp ; 


Nid = 7; -;4 ). From equation 7.8, it is clear that for the o¢ -system, 


Nyy = gz; Now, any set of m’s which is exhaustive (every A-unit counted 
in at least one ”;; zy ), and such that each A-unit is counted in no more 


than one +n a will have a multinomial distribution. From this it 


follows that 
E(gx;) = No Que 
E(9x;7) « Nz (Ng -1) O55" + NaQix 
E(9n; 9x4) = [Cnjat + 3H ryjax + O7az)| 
Eng) FEC 4, May) + Ela gy My) + El; ix Vax) 
= Na Qj4x + Na(Na-1) (Qian? * Gay (Cie Gaz) 
+ O46 x gx Qax) + (Gn Gaz,  Qax ) 


= Na, Orax - N.(Na- f) Q5x nx 
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Substituting in (7.10), this yields 
oun) = TMD Gu [Me I Ge 1 
J 


THEE, Ae neal Deibe Che DeRag] 10 
J 


The variance of uw, is therefore on the order of TN. + T"Ne » at 
maximum. Since the square of the mean is on the order of 7 *nNo | » the 
ratio wir * pecomes indefinitely large as Wy, and T both increase, 

and the Theorem stated in Section 7.1.2 is seen to hold for random training 
sequences of sufficient length, as well as fixed sequences. As the length of 
the training sequence, 7 , increases, the relative frequencies mm; / T will 
approach the probabilities -; ° and the performance of the system will 
approach the performance in a fixed sequence experiment. As WN, goes to 


infinity, the ratio wife” approaches 


a 
a 
|D4 2; aa] / 2. Pj Vx 


7.1.5 Random Sequence Experiments: Examples 


As a "calibration experiment" for comparing different 
systems, the horizontal vs. vertical bar discrimination problem is parti- 
cularly conveniént. The random sequence version of the experiment is as 


follows: 
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EXPERIMENT 3: For the same conditions and classification as Experi- 

ment 1, show the perceptron a random sequence of horizontal and vertical 
bars, each bar occurring with equal frequency ( p: = 1/40 for all bars). 
During this training sequence, S-controlled reinforcement is used, and the 
performance of the perceptron for an arbitrary bar, S, , is then deter- 


mined as before. 


Figure 15 shows the performance of binomial model ag -perceptrons of 
three different sizes on this problem, as a function of the length of the 
training sequence ( 7 ). The parameters <X , y » and @ are the optimum 
values (3, 1, 2) found in Section 7.1.3. Further increases in N, will not 


appreciably improve performance in this experiment. 


The effect of a "frequency bias'"' on oc -system perceptrons 


is illustrated in the following experiment: 


EXPERIMENT 4: The conditions and classifications are the same as in 
Experiment 3, but the horizontal bars occur four times as frequently as 
the vertical bars; i.e., pj = O¥ for horizontal bars and .Of for vertical 


bars. 


Figure 16 shows the performance of a 100 A-unit system on this experiment. 
The upper curve shows the probability of correctly identifying a horizontal 
bar, and the lower curve shows the probability of correctly identifying a 
vertical bar. The correct response to vertical bars is actually suppressed 


as training increases, due to the greater frequency of horizontal bars. The 
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NO. OF TRAINING STIMULI (7) 


Figure 15 PROBABILITY OF CORRECT INDENTIFICATION OF TEST STIMULUS BY BINOMIAL 
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. 3 (RANDOM SEQUENCES) 


of -PERCEPTRONS IN EXPT 
8, y= 1, 6= 2) 


(x= 


NO. OF TRAINING STIMULI (7) 


Figure 16 PROBABILITY OF CORRECT IDENTIFICATION OF TEST STIMULI IN EXPT. 4. 


= 3, y=Il, @ = 2. 
L BARS 


= 100, z 


a 
P. = .04 FOR HORIZONTAL BARS; 01 FOR VERTICA 


BINOMIAL a -PERCEPTRON WITH W, 
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broken curve shows the mean performance on both classes, with test 

stimuli drawn from each class with their appropriate frequencies. In the 

following chapter, it will be seen that this performance can be considerably 

improved ina 7 -system perceptron. It would also be improved for an 
oc -perceptron if error correction training were employed instead of 


S-controlled reinforcement. 


7.2 Discrimination Experiments with Error Correction Procedures 


The analysis and experiments in the preceding section deal with 
S-controlled reinforcement experiments. In Chapter 5, Theorem 6, it was 
shown that this procedure cannot be guaranteed to yield a solution to a 
classification problem, even though a solution may exist, whereas an error 
correction procedure will always yield a solution if any solutions exist. The 
error correction procedure would therefore seem to be the method of choice 
in training a perceptron to discriminate between two classes of stimuli. 
Unfortunately, the type of analysis which was carried out for S-controlled 
experiments is not readily performed with error-correction experiments. 
Consequently, all data on learning curves for error correction procedures 
come from one of two sources: simulation on a digital computer’, and 
performance of actual experiments on the Mark I perceptron at the Cornell 


Aeronautical Laboratory (Refs. 29, 30, 31). 


Experiments performed by Carl Kesler on the Burroughs 220 computer 
at Cornell University. 
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Two main sets of experiments will be described here, the first 
with binomial model oa -perceptrons, and the second with perceptrons 


having additional constraints imposed on their S to A-unit connections. 
7.2.1 Experiments with Binomial Models 


The following four experiments have been performed with 
binomial model perceptrons (having fixed numbers of sensory connections 


to each A-unit, with origins located at random in the sensory mosaic): 


EXPERIMENT 5: The environment of horizontal and vertical bars used 

in Experiment 1 is employed, and the stimuli occur in fixed sequence, first 
showing all horizontal bars in fixed sequence, then all vertical bars, and 
repeating the sequence until perfect performance is achieved. The error 
correction procedure is employed, and the performance is tested at the 


end of each sequence. 


EXPERIMENT 6: The same environment and training procedure is 
employed as above, but the stimuli occur in a random sequence, with 


p= 1/40 for each stimulus (as in Experiment 3). 


EXPERIMENT 7: The environment consists of a set of triangles in all 
possible positions on a toroidally connected 20 by 20 retina, and a set of 
squares in all possible positions on the retina. The triangles and squares 
each cover 80 of the 400 retinal points. The sequence is random, as in 
Experiment 6, with a 1/800 for each stimulus. (The set of possible 
stimuli is generated by translations of a standard image; rotations are not 


permitted. ) 
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(HORIZONTAL / VERTICAL BAR DISCRIMINATION WITH ERROR CORRECTION 
PROCEDURE). SOLID CURVES SHOW MEAN PERFORMANCE OF 25 PERCEPTRONS, 


WITH N= 300, x= 3, y= 1, O=2 
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Figure 17 PERFORMANCE OF BINOMIAL of-PERCEPTRONS IN EXPERIMENTS 5 AND 6 
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Figure 18 PERFORMANCE OF BINOMIAL oc -PERCEPTRONS IN SQUARE / TRIANGLE DISCRIMINATION 
(EXPT. 7) COMPARED WITH HORIZONTAL / VERTICAL BAR DISCRIMINATION (EXPT. 6) 


EXPERIMENT 8: The horizontal/vertical bar environment is employed, as 
in Experiment 6, with stimuli occurring in random sequence. A random 
sign correction procedure is employed for training the perceptron (see 


Definition, Section 5.6). 


Figure 17 shows the results of Experiments 5 and 6, and includes 
a theoretical learning curve for an S-controlled experiment for comparison. 
The experimental curves show the mean performance for a set of 25 binomial 
perceptrons with 300 A-units, and the optimum parameters (x #3, y =/, 

@ = 2. ) found in the preceding section. The same 25 perceptrons were 
employed in Experiments 5 and 6. It appears to be characteristic thata 
random training sequence leads to a more rapid learning rate initially, but 
is overtaken by the fixed sequence performance as the duration of training 
increases. Note that in both cases, the error correction method yields 


considerably better performance than the S-controlled method. 


Figure 18 shows the mean performance of a set of 15 perceptrons 
on Experiment 7. The parameters are N, =300 ,%°6 , y=# , 

QO=~=3 . These were the best parameters tested, but are probably not 
optimum. The learning curve for the horizontai/vertical bar experiment 
(Experiment 6) is shown as a broken line for comparison. The slow learning 
rate in this experiment is largely due to the large number of distinct stimuli 
in the environment (800) compared to the number in the horizontal/vertical 
bar environment (40). The increased number of stimuli means that a much 
longer training sequence is required to guarantee a representative sample 
of all stimuli, with a reasonably uniform coverage of the retinal field. A 
further difficulty is introduced by the fact that the maximum overlap of a 
square and triangle is much greater than the maximum overlap of a horizontal 


and vertical bar, making the discrimination intrinsically more difficult. 
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Figure 19 shows a comparison of the performance of 10 
perceptrons on Experiment 8 with the performance of the same 10 perceptrons 
on Experiment 6. In Experiment 8, the learning is not only much slower, but 
the variability between perceptrons is greatly increased. Of the ten per- 
ceptrons tested, two achieved perfect performance during the period of the 
experiment, which was discontinued after 2000 training stimuli. Nonetheless, 
each of the ten perceptrons would ultimately achieve perfect performance if 
the experiment were continued (due to Theorem 5, Section 5.6). With the 
directed error correction procedure, all ten perceptrons achieved perfect 


performance within 300 training stimuli. 


While the performance of an elementary perceptron with the 
random sign procedure is clearly unsatisfactory for practical systems, it 
should be noted that the existence of a consistent bias in the proper direction 
still makes this a plausible component of a more reliable mechanism. Ifa 
"majority mechanism" is employed (e.g., a threshold device which responds 
to the difference of positive and negative signals from R-units) 
to determine the ''majority vote" of » such elementary perceptrons, 
connected independently to the same retina, a highly reliable system would 


result. The error probability of this system would be: 
[A/2] ; P 
n n- 
Fe eA (4) 24%C-p) 


when / is the probability of correct response for a single perceptron 


(as shown in Figure 19). 
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While the actual learning curve for error correction experiments 
cannot at present be stated analytically, R. D. Joseph has obtained an upper 
bound for the number of corrective reinforcements that must be applied, 
where a solution exists. In the proof of Theorem 4, Chapter 5, it was noted 


that an upper bound for the number of corrective reinforcements can be 


expressed in terms of the quantity oc, as follows: 
2 
ry Zz (A+Myn ) (7.12) 
max — oc M ; 


where M = maximum diagonal element of the G-matrix, 
ao = minimum of the function f(x) = xHz/ || x\|7 (as defined for 
Theorem 4, Chapter 5). 
k= || Hx’ (as in Theorem 4, Chapter 5). 


For the case which is of primary interest here, the process 
starts from the Origin, 80 that 4-2 | H2 *|l #- QO. Inthis case, (7.12) 


simplifies to 


7.2.2 Experiments with Constrained Sensory Connections 


In all perceptrons considered thus far, connections from S-units 
to A-units have had their origins randomly chosen from the set of all sensory 
points, with equal probability. Such models will be called uniform input 
distribution models (u.i.d. models). It has occasionally been proposed that 


the performance of a perceptron might be considerably improved by the 
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introduction of special constraints on the admissible origin point connections. 
For example, the retinal connections could be made to resemble biological 
systems more closely by assigning a ''retinal field'' to each A-unit, and 
limiting its choice of origin points to S-units within this field. A similar 
procedure would be to construct a network of connections by assigning a 
center at random to each A-unit, somewhere on the retina, and selecting 
connections from a circular normal distribution about this center. Such 
systems will be called normal input distribution models (n.i.d. models). 
Further constraints might lead ultimately to specialized A-units, whose 
input configurations are specially designed to make them responsive to 
stimuli of particular shapes, or configuration properties. We will consider 
one further constraint in this section: the case in which the excitatory and 
inhibitory connections to an A-unit are assigned distinct centers on the 
retina, with origins selected from a circular normal distribution about 
these centers. This will be called the divided input distribution (d.i.d.) 
model. The n.i.d. model can be considered a special case of the d.i.d. 
model in which the excitatory and inhibitory centers and dispersions are 


identical. 


In the general d.i.d. model, A-units are characterized by 
seven parameters: X , u and 4 as before, the expected distance 
between excitatory and inhibitory centers (£0), the standard deviation 
of this distance ( oD ), and the standard deviations of the normal proba- 
bility distributions about the excitatory and inhibitory centers (7x and gy ). 
A number of experiments have been performed with such models in an 
attempt to discover what sort of improvement might be achieved by an 


optimum set of constraints on the sensory connections. 
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Experiments 6 and 7 have been used for the study of constrained 
input distributions. In the square/triangle discrimination experiment 
(Experiment 7) the performance of the d.i.d. models never showed any 
improvement over the original u.i.d. model. A large number of combi- 
nations of X , y , and @ were tested with various distribution para- 
meters, in an attempt to find the optimum system for ZX+y £ /0 
The best performance was obtained for a set of 15 perceptrons with X=6 , 

y=%, 6°23 , £0°0 , obD=O0 » OX = 7 ,and oy=7 
This is equivalent to ann.i.d. model with the same centers for excitatory 
and inhibitory distributions, and o =-7 . The performance of this system 
did not differ from that of the equivalent u.i.d. model by more than 1% at 
any point on the learning curve, and was within 1/4% of the u.i.d. performance 
at most of the points tested. The same stimulus sequences were used for 
both models in order to make conditions as closely comparable as possible. 
These results suggest that for large but spatially concentrated stimulus 
patterns, little advantage is to be gained in an elementary perceptron by 


imposing radial constraints on the origin point configurations. 


In the case of the horizontal/vertical bar discrimination 
(Experiment 6) a slight advantage was found for the d.i.d. model for the 
parameters x=/, y=97, 021, ED#=(2, cD=2 , 7x22 , Ty =F. 
On the basis of a number of simulation experiments, this appears to be 
close to an optimum configuration for the d.i.d. model for this experi- 
ment. Figure 20 showsthe results obtained from 25 runs with these 
parameters, compared with 25 u.i.d. models with optimum parameters 
(x=?, y7'(,@= 2) using the identical training sequences. The 


difference, although slight, appears to be statistically significant. 
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The general conclusion from these experiments seems to be 
that (for large stimuli) little is to be gained from special constraints which 
affect only the dispersion, rather than the geometric form, of origin point 
patterns in elementary perceptrons. A further variation of the model, in 
which elliptical rather than circular distributions of origin points are employed 
might be more sensitive to contours and directions of elongation in the stimuli. 


* 
No quantitative results are available on such a model at this time. 


7.3 Discrimination Experiments with R-controlled Reinforcement 


In an experimental system with R-controlled reinforcement 
(Definition 39) the reinforcement control system receives information about 
the outputs of the perceptron, but receives no information directly from the 
environment. Such experiments are of interest in determining the "spon- 
taneous organization" tendencies of perceptrons. It is readily seen, from 
theoretical considerations, that the performance of an elementary «a - 
perceptron in such experiments is unlikely to be of psychological interest. 
In an cc -perceptron, all Gi; are generally greater than zero, so that 
whatever response is associated to the first stimulus in a training sequence 
will tend to generalize to all other stimuli in the environment. Conse- 
quently, the perceptron, left to its own devices without any attempt to 
change its responses, will tend to forma classification C(W) in which 
all stimuli in W _ are either in the positive class or else all in the negative 


ok 
class, with equal probability. 


* 
See Section 23.1.2 for a reconsideration of this problem from the 
standpoint of sensory analyzing mechanisms. 


*#In Ref. 82, such systems have been called Class C perceptrons". 
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Two special cases are of interest, in which it is possible for 
a dichotomy to be formed with both classes non-empty. In the first case, 
some of the A coefficients are zero. This might occur in a system 
with high thresholds on the A-units, so that some pairs of stimuli activate 
no A-units incommon. If §$; and S$; are two such stimuli, then if S; 
is the first stimulus and Ss; is the second stimulus in the training 
sequence, it is perfectly possible that one will become associated to a 
positive response, and the other to a negative response. If these are the 
only two stimuli, or if there is no positive generalization from any of the 
stimuli which become associated to one class to the stimuli of the second 
class, this dichotomy may be stable. In general, however, one class is 
apt to become dominant, eventually pulling all stimuli into a single class 
as before. The second case in which a dichotomy might be formed is that 
in which the values are not initially all zero, but are distributed with some 
connections negative and some positive. In this case, the generalization 
from the first stimulus will not necessarily wipe out an initial bias in the 


opposite direction, and it is possible that a dichotomy will be formed. 


While it is possible for dichotomies to be formed in the special 
cases mentioned above, there is little reason to suppose that such dicho- 
tomies would ever be of interest to a human observer. If the stimuli are 
uniformly distributed on the retina, or uniformly clustered about the 
center of the field, the 9:7 coefficients which happen to be zero will 
generally be unrelated to possible 'meaningful" classifications of the 
stimuli, so that any division into two classes will tend to be random, 
and unrelated to any concept of "intrinsic similarity" of the stimuli. Thus 


it is clear that in an elementary O¢ -perceptron, psychologically meaning - 
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ful discriminations can be achieved only under the control of an experi- 
menter, or r.c.s. which is capable of evaluating the correctness of the 
perceptron's responses according to some predetermined scheme. In the 
Y -systems, which are considered in the following chapter, somewhat 
more interesting performances.inR-controlled experiments are likely to 


occur. 


7.4 Detection Experiments 


In discrimination experiments, such as those considered in 
the previous sections, the perceptron is required to give one of two responses 
to designate which of two well-defined classes of patterns is present. It is 
assumed that one of the two is always present, and that nothing else is 
present which might confuse the picture. In detection experiments, a 
single pattern, or class of patterns, is taught the perceptron as the "positive 
class"', and anything else (such as noisy fields, arbitrary patterns, etc.) is 
considered to belong to the "negative class''. Moreover, the positive pattern 
may appear with an admixture of background noise, irrelevant lines, or 
other sensory material. While such detection experiments differ considerably 
in their "psychological" character from discrimination experiments, from a 
theoretical standpoint they represent a special case of discrimination experi- 
ments in which the training and the two classes of stimuli are highly asymme- 
tric, the positive class generally being smaller but more thoroughly trained 
than the negative class. Two cases are of interest: detection in noisy 
environments, and detection in organized environments. These are 


considered separately in the following sections. 
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7.4.1 Detection in Noisy Environments 


A noisy environment will be defined as the product set of a 
set of well-defined stimulus patterns (including an empty field as a stimulus) 
and a set of ''random noise patterns'' superimposed on the members of the 
firet set. The random noise patterns are generated by applying signals of 
random polarity (positive or negative with .5 probability) to a randomly 
selected set of S-units, chosen independently with probability P, . PF, will 
be called the noise density of the environment, and represents the expected 
value of the proportion of S-points which emit random signals at any given 


moment of time. 


Note that a noisy environment is, in ite entirety, a well defined 
set of stimuli, with a probability p; associated with each stimulus J; 
Such an environment consists of two classes: a positive class, in which one 
of the ''positive stimuli" (e.g., a geometric form) is present in combination 
with one of the noise patterns, and a negative class, consisting of the noise 
patterns alone, or the ''empty field" stimulus with a noise pattern super- 
imposed. The task of the perceptron is to distinguish between positive and 


negative stimuli. 


Let Sy, represent a test stimulus, selected from the positive 
class. Then the probability of correctly identifying S, as a positive 
stimulus in a random sequence experiment, with S-controlled reinforce- 
ment, is given by equation (7.7), with F(u,) defined by equation (7.9) 


and o *(u,) defined by équation (7.11), just as in an ordinary discri- 
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mination experiment. Similary, if S$, is a noise-stimulus, from the 
negative class, the probability of obtaining the correct (negative) response 
is given by the complement of the probability obtained from equation (7.7). 


Some special analytic features of this problem are worth noting. 


For a binomial model, with a large retina and large association 
system (so that all @Q -functions and retinal intersections of noise patterns 
can be assumed equal to their expected value) the intersection of a noise 
pattern with any other stimulus will be equal to the expected value of this 
King 


* 
intersection. If we designate the noise patterns by S,, S,’,. 


and positive stimuli by Sz, Sy‘s---» then (as explained on page 146), 


Qnn' > Qn Qn! and 
Onx . Q, Px 


Let S, and S,’ represent the same positive stimulus pattern with 
different noise patterns superimposed. Then, if the noise density is 
low, Oxx' @ Qyx = Q, . But Qy >> Q, Qy . Therefore, 
Qxxv' > > Quy » which means that the perceptron can be taught quite 


readily to give the proper positive response to a test stimulus, S, 


~“ 
Actually, as noise patterns have been defined, the intersection of a 
pure noise pattern with a positive stimulus pattern will be slightly 
less than the expected value, since some of the points which normally 
are "'on'' for the positive stimulus will be turned "off" for the noise 
pattern. The conclusions above hold rigorously if the noise patterns 
are sets of positive signals only. 
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The same conclusion does not hold for the identification of a negative 

(noise) stimulus, however. In this case, the generalization from a previously 
trained noise stimulus, S,’ to S, is equalto 4Q,:, = Q, (assuming 

all noise stimuli to be equal in area to their expected value). But the 
generalization from a positive stimulus is Q,, = Q,Q, which is generally 
greater than Q, , Since the area covered by the positive stimulus with 
noise superimposed is generally greater than the area of the noise stimulus 
alone. Consequently, we would expect the positive response to tend to 
generalize to the negative class as well, if both classes are represented 


with equal frequency in the training sequence. 


A slight modification of the perceptron should improve its 
capability of distinguishing negative stimuli from positive ones. If the 
R-unit is given a threshold greater than zero, it will tend to remain "off" 
for the relatively weak signals coming from noise stimuli, but will go "on" 
(to its positive state) for the stronger signals coming from positive stimuli. 
With this modification, however, the system is no longer an elementary 
perceptron. An alternative procedure, which will improve the perfar mance 
of an elementary perceptron, is to ''overtrain'' the negative stimuli, 
composing a stimulus sequence in which negative stimuli occur more 
frequently than positive ones. In an error correction experiment, it 
should be noted, this bias will be introduced automatically, regardless of 
the stimulus sequence, so that a detection problem should be solved much 


more readily than with an S-controlled system. 
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7.4.2 Detection in Organized Environments 


In an "organized environment", where the background material 
may closely resemble the stimulus pattern in its characteristics, detection 
experiments take on some characteristics of special interest, psychologi- 
cally. First of all, it should be noted that in attempting to distinguish a 
pattern such as the letter ''X'' against a background of lines occurring in 
random configurations, the environment may include stimuli which are 
fundamentally ambiguous in character, since patterns closely resembling 
the letter 'X"', or even identical to it, might arise by a chance super- 
imposition of straight lines. In such a case, the only reasonable test of 
whether or not a pattern should be identified as an ''X" would seem to be 
the human criterion of whether it looks more like an X or more like a 
random assemblage of line segments. While a similar problem might 


arise, in principle, in the case of detection experiments in noisy fields, it 


: is less common there, except under extreme noise conditions. In the case 


of organized fields, ambiguous organizations are more the rule of the day, 
and the problem requires a different approach. In human perception, the 
Properties of "good figure''are geerally used to determine whether a 
particular set of line segments is been as a letter, or some other known 
pattern, or simply as a random collection of unrelated components. Such 
judgements are not possible, however, for elementary perceptrons. We 


will return to the problem of figural organization in Part IV. 


Treating the detection experiment simply as a special case of 
a discrimination experiment, the same conclusions apply as in the case 


of the noisy environment problem: it is possible, by exhaustively training 
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the perceptron with the product set of positive stimuli and irrelevant 
patterns to teach it to identify positive stimuli amidst extraneous material. 
The learning is apt to be slow, however, and will generally fall considerably 


short of what might be expected in a simpler discrimination experiment. 


Most of the experimental work done to date on detection 
experiments has been carried out with the Mark ! perceptron using a gamma 
system for the memory dynamics. This work will be reviewed in the follow- 
ing chapter, which deals with 7 -perceptrons, but similar results might 


be expected with alpha systems. 
7.5 Generalization Experiments 


In the preceding experiments, it has been required that S, 

should necessarily occur as one of the stimuli in the training sequence. 
When the perceptron is tested with a stimulus which has not been previously 
seen, a weak form of generalization is possible with elementary of -systems. 
Clearly, if the intersection of S, with some other stimulus in the same class, 
Sy’ » which did occur in the training sequence, is large enough, Sy will 
tend to evoke the same response as Sy' . In this case, Sy is correctly 
recognized only because, within the limits of tolerance of the perceptron, 
it appears to be identical, rather than merely similar to, the previously 
seen training stimulus. Thus, generalization, for an elementary o¢ -perceptron, 
is based on an approximation to identity, rather than on similarity. Ina 

"pure generalization" experiment, as defined in Chapter 3, the perceptron 
would be asked to recognize a pattern in a position where it does not 


overlap any previously seen patterns of the same class. If such an 
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experiment is performed with an o/ -system, with a single class of 

stimuli, the generalization will tend to be positive, due to the fact that Q. c 

is never zero, for most systems, regardless of the relative positions of 

the stimuli. This result is trivial, however, and of no psychological interest, 
since any stimulus, whether it resembles the trained stimuli or not, will also 
tend to evoke the same response. To prevent such a tribial result, it is 
necessary to employ a discrimination test, training the system with two 

kinds of stimuli, and then testing it with similar stimuli in a disjoint portion 
of the retina to find out whether the appropriate responses have generalized 
for both kinds of stimuli. In this case, if the stimuli are of equal area, and 
equally trained, no generalization will be found, since the positive generali- 
zation from one class is exactly balanced by the negative generalization 

from the other class. Thus it is clear that an elementary o¢ -system (and, 
in fact, any elementary perceptron) is incapable of abstracting similarity 

(in either the geometric or the psychological sense) but discriminates only 

by measuring a function of the overlaps of a test stimulus with representatives 


of both classes. 


7.6 Summary of Capabilities of Elementary © -perceptrons 


The elementary co -perceptrons, being the simples class 
of perceptrons, provide a baseline of performance against which other 
systems can be compared. It has been demonstrated that the cd -system, 
with both S-controlled and error correction reinforcement, is capable of 
discrimination learning, provided it sees a large representative sample of 
the stimuli which it is required to discriminate. It does not generalize 


well, to similar forms occurring in new positions in the retinal field, and 
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its performance in detection experiments, where a familiar figure appears 
against an unfamiliar background, is apt to be weak. More sophisticated 
psychological capabilities, which depend on the recognition of topological 
properties of the stimulus field, or on abstract relations between the 
components of a complex image, are lacking. The elementary perceptron 
has no capability of recognizing time sequences, since its responses are 
based on the momentary state of the system due to the current stimulus 
pattern alone, and are not influenced by the preceding sequence of events. 
Quantitative judgement might possibly be learned by an exhaustive training 
procedure, in which the system is required to give one response for 

stimuli above a certain area, or over a certain length, for example, and 

an opposite response if they fall short of the criterion. This is a rather 
crude approximation to quantitative estimation, however, and the problem 
can be handled much more satisfactorily with perceptrons with linearly 
responding R-units, as will be seen in Chapter 10. In R-controlled 
experiments, where the perceptron is required to form its own classification 
of stimuli, we have seen that the elementary o/-perceptron tends either 
to classify everything identically (its most general tendency) or else to 
form a random dichotomy, which is of no psychological interest. It will 

be found that most of the weaknesses of elementary oo -perceptrons are 
true of all simple perceptrons, and that it is necessary to go to topologically 
more complicated systems to find performances which are basically more 
satisfactory. In special cases, however, other types of simple perceptrons 


have advantages, as will be seen in the following chapters. 


-192- 


Google 


Tet Functionally Equivalent Systems 


It may be disturbing to some biologically oriented readers to 
think of an association unit that changes the sign of its output signal from 
excitatory to inhibitory as a function of its training. This is a conceptual 
simplification which makes analysis easier, but can be shown to be logically 
equivalent to an alternative model in which particular neurons, or A-units, 
are designated as excitatory, and others as inhibitory, with no change 
permitted in the sign of their outputs. The alternative model (which is 
analogous to the models originally presented in Refs. 79 and 80) is as 


follows: 


Let the number of A-units be twice the number in the equivalent 
O¢-perceptron. Let half of the A-units be designated as excitatory units, 
and the other half be inhibitory units. All 7, are initially assumed to be 
zero, or else to have positive signs if @; is excitatory, negative signs if 
@; is inhibitory. Each excitatory unit is paired with one of the inhibitory 
units, and the same origin point configuration is assigned to both members 

of the pair. Thus the responses of the inhibitory units exactly duplicate 
the responses of the excitatory units. The reinforcement rule is that a 
positive ” fromthe r.c.s. affects only the excitatory units, while a 
negative 9” affects only the inhibitory units. With this rule, the signal 
«%; which goes to the R-unit in response to S$; is the sum of an 
excitatory component and an inhibitory component, the total being exactly 


equal to what it would be in the equivalent o¢ -perceptron. 
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The exact pairing of the excitatory and inhibitory units is, of 
course, an inessential artifact, introduced only to guarantee that the two 
types of systems are truly identical in performance. If the origin confi- 
gurations of all units are selected independently of one another, the 
expected values of the signals will be unaffected, but the variability will be 
somewhat increased, due to the greater number of independent A-units 
contributing to the signal. Such a system has been previously described as 


a "differentiated A-system" (Ref. 79). 
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8. PERFORMANCE OF ELEMENTARY 7 -PERCEPTRONS IN 
PSYCHOLOGICAL EXPERIMENTS 


It will be recalled that the reinforcement rule for a gamma 
system (defined in Chapter 4, Def. 38) is one which guarantees that the 
sum total of the value of all connections to any unit remains constant, even 
though the values of individual connections may change with time. In the 
notation of the last chapter, the change in the value of the connection <;, 


due to the reinforcement of stimulus S- 


yj; was given by 


Av;, = 2; a; (i) for an oac-system. (8.1) 
For a gamma system, the corresponding expression is 
¥,, f ; 
Av;p = 2; [af (i) - 7 >. agli ) 
a | (8.2) 


A variation of the gamma system, which will be designated the 7 -system, 
is of interest chiefly because it is considerably easier to analyze. For this 


model, 


An, = 2;[aj (i) - @;] 
(8.3) 


This is equal to the expected value of Avy;, forthe g*-system, and 
with large values of ¥, the 97*-system and 7’ -system become indis- 


tinguishable. 
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The organization of this chapter will follow closely that 
of Chapter 7. The first section deals with the analysis of discrimination 
experiments with S-controlled reinforcement, and presents results of a 
number of experiments, including comparisons with the c-systems 
considered in the last chapter. Discrimination experiments with error 
correction, and discrimination experiments with R-controlled reinforce- 
ment are then presented, and the final sections deal with detection 


experiments, and other performances of 2 -perceptrons. 


8.1. Discrimination Experiments with S-controlled Reinforcement 


8.1.1 Fixed Sequence Experiments: Analysis 


As in the case of the alpha-system analysis, our object is to 
compute the ratio v7; (ux)/o i) , for the class of perceptrons, test 
stimulus, and training sequence under consideration. The notation and 
definitions correspond to those employed in Chapter.7. The analysis again 
follows that of Joseph (Ref. 41). For the 7 -system, the expected value 
of u, is obtained as follows: The value of the connection from the A-unit 


a; at the end of the training sequence is given by: 


toed f x, 
TD 2; 2; Jez) “ Zi ay Wi) 
J 


Vir 


N,A,-f / e,, 
7) OP; [ta av(j)-s- Lae | 
J 


a kt 
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Consequently, if the test stimulus S, is now shown, the input to the 


response unit will be 


by = TD, > 2; P; ES : a} (ix) = rs aye) 2 23(4) | 
god . 


yielding, for the expected value of the signal uw, , 


No! / 
elon) = TEE Hes |B Rie Fe EY 


= T(Na-1) 2,2; #3 (Qjx~ QA) 
(8.4) 


Fora g -system, the analysis is considerably simplified. In this case, 
the value of the connection from unit @; at the end of the training sequence 
is 


Vir = rh A; t; Lar(s) 7 @] 


Collecting the signals from all active connections when S, occurs yields 


the input to the R-unit, 


a Sia TD, D PP; aj (ix) -Q; a (x)| 
é od 


and the expected value of this signal is 


E(uy) = TNg 2.4 Py (Ox - Q; Qx) 
J (8.5) 
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The variance of &, is again computed from the general 
equation (7.4), given in the last chapter. Fora j’-system, the same 
considerations apply as inthe o -system, namely, that the only source 
of variability in the signals L;p (x) is due to the origin point configurations 
of the A-units, which are selected independently for the different A-units. 
Consequently, the equation (7.5) holds identically fora 7° -system. Ina 
true 7° -system, however, the signals Lip (x) are not independent. The 
value 7v;, upon which L;p depends is the result of a series of increments, 
AVi, _, each of which depends upon the particular set of A-units which are 
active at the time of reinforcement (as shown in Equation 8.2). Consequently, 


for a gamma system, the variance is 
o *(uy) - NN a” [<p (x)) +N, (Na-1) cov. l<i- (1), £35 x) | 


= [E28 (x) - eo (x)| + N.(N,-!) [E cf, (2) £'p (x) 
- Bip (2) E c3'7 a) | (8.6) 


The reader who is interested in the detailed analysis of this expression 
will find a full algebraic expansion of its components in Ref. 41. The 


final equation which results is as follows: 


oXu,) - rN y (Na- 1-20) [(Q: te - % MM Cx) 

om Na ; A FPA Pi PA) Na Cx) CG An Qj Oe Ox 
— 204 (Qig- 2 Qx)|- Ma-2) |(Qjx~ G Ox) Ree I~ 2 (QGa- G ad 
+ 0x(Ga-G 2a} 


(8.7) 
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An analogous treatment for the 7’-system, based on Equation (7.5) , yields 


the expression: 


o*(u,)=T7N, LL PD; Pa Pj Ph ((Qi42-Q; Q4Qx) - 204 (Qjz- 2 A) 
J 


= (Qjx-Q Cx) (Qaz- 94 % )) 
(8.8) 

For both the J’-perceptron andthe 3°’-perceptron, the expectation of u, 
and the variance of «, are both onthe order of MN, . Consequently, 
the ratio wie : can be made arbitrarily large by increasing Na 
This means that the theorem stated in the last chapter (Page 159) holds for 

y and = (“Jf’-perceptrons as wellas for of -systems. Equation (7.7) 
can again be used for a close approximation to the actual probability of 
correct response fora J* or J°’-perceptron, substituting the appropriate 


expressions for the mean and variance in each case. 


It is interesting to note that if the expected values of the 
generalization coefficients, 9g; j + are substituted into equations (7.3), 


(8.4), and (8.5), identical expressions are obtained for the expectation 


of uw, forthe of , gy? ,and 9Y’-systems. The expected value of 
the un-normalized coefficient, 9ij » fora J -perceptron is 
(Na-1HQ;;- 9:9); for a 7’ -perceptron itis N,(Q;; -Q; Q;) ,» while 


for an of¢-perceptron itis W/W, Q; j «+ Substituting these quantities, we 


obtain, for all three systems, 


E(ux) = 72,2; py Fj (8.9) 
J 
a, = 7) A; 9x; (8.10) 
J 
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The special properties ofthe 7 and 7 -perceptrons are due to the 
fact that their generalization coefficients for a binomial model tend to be 
negative for sufficiently well separated, or disjoint, stimuli, whereas in 
the case of an a¢ -system, the generalization coefficients are all non- 
negative. In a Poisson model, while it is possible for negative generali- 
zation coefficients to occur due to random variability of individual per - 
ceptrons, the expected values of 9; j are always non-negative, since 
Qa; = Q; Q; only if the stimuli are disjoint. These features are of 


interest for R-controlled experiments, as will be seen presently. 


8.1.2 Fixed Sequence Experiments: Examples 


Numerical analyses have been carried out mainly for the 

7 -perceptrons, since the equations are considerably simpler. For 
large values of MN, , the 7’ and J -systems will have identical perform- 
ances. Tables 3 and 4 (in Chapter 7) apply identically tothe J ’-system, 
for Experiments 1 and 2. The performance curves shown in Figures 13 and 
14 are also applicable. Figure 21 shows a comparison of the 7 and ot - 
systems on Experiment 1] (horizontal vs. vertical bar discrimination), for 
the optimum parameters with a binomial model ( x=3, y=/, 9 =2 ). 
Figure 22 shows a similar comparison for the same parameters, with Experi- 


ment 2. 


It is clear that under the conditions of Experiments 1 and 2, the 
J -systems have no advantage over the o¢ -perceptrons. The equivalence 
of the curves is due to the fact that in these experiments, all stimuli are 
equal in area (yielding equal Q; for all stimuli), the number of stimuli in 
each class is equal, and all stimuli occur with equal frequency. If the sizes 
or frequencies are unequal, the 3” -system may have a marked advantage, 


as will be seen in the analysis of Experiment 4, in Section 8.1.4. 
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8.1.3 Random Sequence Experiments: Analysis 


The un-normalized generalization coefficients fora 7 and 


7 ‘-system are given by 


Gig Py Na fora j -system (8.11) 
Gig Se Ne Qin; fora 7 -system (8.12) 


where rn, a. a the number of A-units responding both to S; andto S$: 


As inthe o¢ -system analysis (Section 7.1.4) the training 
sequence is assumed to consist of 7 stimuli, where each stimulus, S$ Sa 
has a probability Pp; of being selected at any step of the training sequence. 
The analysis has been carried out only for the 7'-perceptron, since the 
true 7 -system leads to excessively cumbersome expressions for the 
variance. For large NV, , as observed in the preceding section, the two 


systems should be virtually indistinguishable in performance. 


For the 7''-system, the input to the response unit when 5, 


occurs after the training sequence is 


Uy = La; (ny; - OQ; 72) 


J 


where m; » as before, is the number of times that ay occurs in the 


training sequence. Taking the expected value of this expression, we 


obtain 


E(uy) = TN, Le P; (Q7x -Q; Qx) 
J 


(8.13) 
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The variance of «, over both perceptrons and training sequences is 


again given by equation (7.10). In the present case, this yields: 


oy) = TN, De P; | 2;2- 20; Dye0 oO; + (Na-1)(Qy- Q. a)? | 
J 


es eT nate [(r- 1) (Qiax 9; 94x Oa Qe +O; OO) 
J 


~ (THN, - CO ry ~ 2 Oy Og x % 2,)] 
(8.14) 


The detailed derivation of this expression can be found in Ref. 41. It can 
readily be seen that the theorem of Section 7.1.2 continues to hold for this 


system. Actual performances can again be calculated by using Equation (7.7). 


8.1.4 Random Sequence Experiments: Examples 


A comparison of binomial aw and 7 -perceptrons on the 
random sequence version of the horizontal/vertical bar experiment 
(xperiment 3) is shown in Figure 23. A curve obtained from the simulation 
ofatrue 7 -system with the same parameters is included for comparison. 
The simulation curve shows the average of 100 runs. Figure 24 compares 
the performance of the binomial model with that of a Poisson model, on the 


Same experiment. 
In Figure 25, the performance of a 7 ‘-system in the 
"frequency bias" experiment (Experiment 4) is shown, with the mean 


performance curve of the equivalent ca -system, from Figure 14, 


included for comparison. A comparison with Figure 16 shows that under 
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conditions of unequal frequency for the two classes to be discriminated, 
the jf -system may have a marked advantage. The effect of frequency 
bias ona 7 -system is also shown in a number of simulation experiments 
with the IBM 704 computer, which have been described previously (Ref. 84). 
The horizontal/vertical bar discrimination problem happens to show up the 
v -system to its best advantage, since, with a binomial perceptron, the 
expected value of the generalization coefficient, g;; , where S; and S; 
are in opposite classes, is zero for this particular problem. A Poisson 
model, where the interaction between the horizontal and vertical bar classes 
is non-zero, would not perform as well in this experiment, and the binomial 
model would also perform less well in experiments with classes of stimuli 


which could achieve greater intersections. 


Figures 26, 27 and 28 show some typical experiments performed 
with a digital simulation program, for binomial 7° -perceptrons of sizes up 
to NM, = /000 , anda 72 by 72 retina. The stimuli are kept within the 
retinal field in these experiments by requiring that their centers remain 
within a 13 by 13 field, so that there are 169 possible positions for each 
stimulus. In Figure 26(b), the effect of allowing rotations up to 30 degrees 
and up to 359 degrees (inclusive), in addition to displacements within the 
retinal field, is illustrated. Figure 28 shows the effect of size bias where 
one class of stimuli (the letter ''F'’) can be considered as subsets (on the 
retina) of stimuli of the other class (the letter ''E"'). With purely excitatory 
connections from the retina, the situation is clearly much worse than with 


both excitatory and inhibitory connections, as shown in Figures 28(a) and (b). 


From the equations for the expected value of the signal 
(Equation 8.13, for example) it can be seen that a bias in the correct 


direction may exist even when the perceptron is occasionally reinforced 
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Figure 27 SQUARE-DIAMOND DISCRIMINATION. VW, = 1000, x= 10, y=0, O=4% 
CENTERS PLACED IN 13 x 13 FIELD 


-209- 


Google 


NOILVNINIZISIC udu 


(VTAMILS 40 “OM) 3WIL 
08 i 


"SA 43, 8% 04n614 


(11NMILS JO °OM) 3NIL 
08 HD 


1314 Gt * Cb MI G39VId S¥3LN39 


o=4 s=06 
ol=% oo1="W 


-210- 


Google 


in the wrong direction. Several experiments have been carried out by 

Hay using the Mark I perceptron at CAL, to study the effect of "random 
errors" by the experimenter training the machine (Ref. 30). In an 
experiment on the discrimination of the letters "E'' and "X'' witha 7 - 
perceptron employing S-controlled learning, it was found that the perceptron 
learned to discriminate the letters with 100% accuracy despite the introduction 
of 30% misidentifications by the experimenter (i.e., by the r.c.s.). This 
experiment emphasizes the fact that the perceptron can exceed the level 


of performance of its ''teacher'' or reinforcement control system. 


8.2 Discrimination Experiments with Error-Corrective Reinforcement 


While it has been demonstrated in Chapter 5 (Theorem 8) that 

the error correction procedure will not always lead to a solution with the 
7 system, practical systems seem to work about as wellas od-systems, 

and may actually learn somewhat faster in some cases. Figures 29 and 30 
illustrate two sets of experiments on jf" -perceptrons, using the 
Burroughs 220 computer at Cornell University, in which performance is 
compared with perceptrons having the same topological organizations, but 
employing an o¢ -system memory rule. Since the error correction 
procedure will lead to a solution regardless of sequence or relative 
frequency of stimuli in the classes being discriminated, and regardless 
of relative sizes of stimuli, the special advantages of the g'-system in 
overcoming frequency bias and size bias are relatively unimportant here. 
In most experiments with error-corrective reinforcement, therefore, the 


simpler O¢-rule is generally employed. 


-2ll- 


Google 


€=6@ 'h=4 ‘9 =x ‘o0e = "WV HLIM 


*$38V0 S 40 NVM 


(2 "LdX3) NOILIIYYOD 


YOUN HLIM NOLTLVNINISOSIC JIONVIYL 


JUVNOS NI SWSLSAS £ ONY 7°40 NOSIYYdNOD Of O4n614 


00S 


(#) 11NWILS OMINIVEL 40 °ON 
00h 


O0€ 002 001 


0 


ee oe ee ee 
t JONVHYOINId JONVHD % 
 WRISAS-2 : 

ee ee cee ee 
WILSAS-70 | 
one te ga aa a 


C 


=@ ‘| =/4 ‘¢ = x ‘QE = "W HLIM S3SVO SZ 
JO NV3N °(9 “LdX3) NOILOZYYOD YOUNT HLIM 
MOILYNIWIYDSIO YY TWOILYSA/TVLNOZIYOH NI 


SN3LSAS £ ONY  TVIWONIG JO NOSIYYdWOD 62 94N614 


(2) VIOWILS ONINIVaL JO “ON 


02! 00! 08 09 Oh 


oe 


dor cocee 


meeeeeese Qe sc seee see gece eretoco eg ens et eee 


4 


eeccceredsecarecnsce 


Gromer ceee 
Gree eecen 


weweevenvbecewecorctoesecescccOQoesoeseoe 


Cy er 


wanes eeboceecceone 
ees eee re 


erecncoes 


ee oe ee eee 
« 
8 
6 
‘ 
‘ 
e 
' 
’ 
e 
ee Serer 
i] 
e 
a 
s 
‘ 
‘ 
U) 
t 


ae 


02 0 
0 
eer rs 1° 
Ses aise we rs 
er Siacg. 6" 
h 


: JONVNYOINId JONVHD 7 

Se PO ee ee eee 

oe ee oer ey ee 

WALSAS-79 

ee cee eon eae Et 

: [7 

— eee ee 

i WALSAS-£ 
seeeveceecteceesese- O71 


-212- 


8.3 Discrimination Experiments with R-controlled Reinforcement 


The performance ofa 7 -perceptron in R-controlled 
experiments (where the r.c.s. is entirely isolated from the environment 
and reinforces the perceptron positively at all times, regardless of what 
its current response happens to be) is somewhat more interesting than that 
of the oa¢-perceptron. Since it is possible to have negative generalization 
coefficients for the 7° -model, two distinct possibilities suggest themselves 
which were not present before: (1) The system may form an unstable 
classification of the environment, with individual stimuli continually shifting 
membership from one class to the other, due to negative interaction between 
successive reinforcements; (2) the system may form a stable dichotomy with 
some stimuli in the positive class and some in the negative class. The third 
possibility corresponds to the expected situation with an o¢ -system,namely: 
(3) The system may form a stable classification with every stimulus in the 


same class, the alternative class being empty. 


An unpublished theorem by H. Kesten proves that (fora J - 
system in which the values are allowed to grow without bound) the first 
alternative is impossible. Every perceptron will ultimately form a "stable" 
classification, in which every stimulus is assigned to one of the two classes 
and will remain in the same class with probability 1 at any future time. The 


remaining two alternatives both remain possible, however. 


At the present time, a fully satisfactory analysis of the classi- 
fication tendencies of g*-perceptrons which are "left on their own" in an 


R-controlled experiment is not available. A number of special cases can 


x 
Personal communication. 
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be analyzed heuristically, however, and some of these are illuminating. 
Moreover, a series of simulation experiments has been completed which 


illustrates performance on some typical problems. 


The basic feature of this system in an R-controlled experiment 
is a tendency to classify stimuli on the basis of retinal location, rather than 
geometrical similarity. If two stimuli occur in the same location on the 
retina, covering largely the same set of sensory points, 9,;; will tend to 
be positive, so that the reinforcement of one stimulus will tend to generalize 
automatically to the other. A "cluster" of such stimuli, projected onto a 
limited region of the field, will tend to be classified the same way, either all 
positive or all negative. On the other hand, two stimuli which cover disjoint 
sénsory sets will (in a binomial model) tend to have a negative 9. a In 
this case, reinforcing S- with y positive will automatically assign S; , 
to the negative class, if its value was previously zero. Thus, clusters of 
stimuli which are "well separated" will tend to go into opposite classes, with 
a binomial 7 -perceptron. The following experiment illustrates this 


tendency quite clearly: 


EXPERIMENT 9: For the same retina and environment of horizontal and 
vertical bars described in Experiment 1, let the stimuli occur in a random 
sequence, as in Experiment 3. During the training sequence, R-controlled 
reinforcement is employed. The response to each of the 40 bars is then 
determined, to establish the classification which has been developed by the 


perceptron. 


In a Poisson model, the expectation of 9;;- for disjoint stimuli is zero, 
inthe 7-system, and all stimuli will tend to go into the same class 
unless they form completely disjoint clusters, in which case the class 
assignment will be random for each cluster. 
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In a number of repetitions of this experiment (which was 
simulated with a 704 computer for a very large, or infinite NV , binomial 
perceptron, it was found in every case that the perceptron placed ten 
adjacently located horizontal bars and ten adjacent vertical bars in the 
positive class, and the other ten bars of each type in the negative class. 
The dynamics of the process can be readily followed in a heuristic fashion. 
The first bar to be seen -- say a vertical bar -- may evoke a positive or 
negative response at random. If p*=a+f  , then the connections from the 
responding A-units will each gain a positive increment of value, and connections 
from inactive A-units will become slightly negative, so that the total value is 
conserved. For two disjoint bars in the "same" class (i.e., both horizontal 
or both vertical) 9,;; will be negative, but for the two closest neighbors on 
either side, g;; will be positive. The generalization, 9;j + to members 
of the "opposite" class (i.e., one horizontal and one vertical) will be zero, 
since the intersection between any horizontal and vertical bar, in this 
environment, is equal to its expected value, yielding zero generalization for 
a binomial 7°-system (see Page 146). Consequently, the horizontal and 
vertical bars will never interact, regardless of the sequence in which they 
occur, and each of these two sets of stimuli will organize independently. 
Consider, therefore, the development of a classification for the vertical bars, 
after the first has been associated to pts */. If the second vertical bar 
in the training sequence should happen to be one of the two close neighbors 
on either side of the original bar, this will-immediately evoke the response 

p= +f, and will be reinforced in the same direction as the previous bar, 
extending the net positive generalization to at least one additional member of 
the vertical set. At the same time, vertical bars which are more than two 
positions removed from both of the bars already seen will now have twice 


the negative reinforcement that they received before, due to the summation 
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of the negative @, joo If one of these bars should occur, the response will 
be -land willbe negative. This will not only spread negative value to 
the adjacent stimuli, but will add to the positive value of the stimuli which 
were previously placed in the positive class. Thus two mutually supporting 
"nuclei" of stimuli are formed, one in the positive class and one in the 
negative class, which tend to spread their domain to neighboring stimuli, 
but tend to "repel'' remote stimuli, supporting their adhesion to the opposite 
class. Under these conditions, it is plausible that the most stable balance 
between classes will be found when the classes are evenly divided, each 


tending to attract marginal stimuli from the other to the same degree. 


Simulation experiments with this procedure show that a stable 
dichotomy tends to be formed after the first few hundred stimuli of the 
training sequence, the probability of a change in class membership being 
very small thereafter. The terminal condition is of the type indicated above, 


with 10 horizontal and 10 vertical bars in each class of the dichotomy. 


8.4 Detection Experiments 


In detection experiments, the same general conclusions hold 
true as in the case of O¢ -systems (Section 7.4). In the case of noisy 
environments with a large retina, it was noted that the intersection of a 
noise pattern with any other stimulus will be equal to the expected value 
of the intersection, i.e., to the product of the measures of theactive. 
S-sets. For the binomial gy -system, this implies zero generalization 
from a reinforced "positive'' stimulus to a noise pattern, and zero 
generalization from one noise pattern to another. This means that a 


class of positive stimuli can be learned without any generalization to noise 
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patterns, but that negative training on a limited sample of noise patterns 
does not generalize effectively to new noise patterns. As in the case of 
the O€-system, the use of a threshold greater than zero on the R-units 
should effectively separate positive stimuli from noise patterns. It is 
worth noting that for discriminating a single class of positive stimuli 

from noise, a monopolar reinforcement system (Defintion 35, Chapter 4) 
will work as effectively as a bipolar system, since reinforcement given for 
negative responses has little or no effect on future performance (except for 


those noise patterns actually seen, or nearly identical to those seen). 


Several experiments have been performed with the Mark I 
perceptron at CAL to evaluate the performance of 9Z°-perceptrons in noisy 
environments, and in problems in which positive stimuli such as letters of 
the alphabet have been mixed with extraneous, but similarly organized 
stimuli (geometric patterns, other letters, etc.). Performance on the 
discrimination of the letters "E"' and "'X"' with various amounts of noise 
present has been reported by Hay in Ref. 30. Two 240 A-unit perceptrons 
were tested, both learning to perfection in the absence of noise. With noise 
present, one perceptron learned as well as before, the second falling to 
about 75% accuracy. The amount of noise introduced was not carefully 
quantified in these experiments, but it is clear that the perceptron can 
perform appreciably better than chance as long as a human observer can 
still detect the original letters embedded in the image. In the experiments 
with superimposed images of irrelevant patterns, a poorer level of 
performance is obtained. A perceptron trained to respond positively to 
the letter X, with monopolar 9 -reinforcement, will generally give the 


proper response whenever an "X" is present, but tends to give the 
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positive response quite frequently to triangles, squares, or other letters as 
well. The introduction of a high response threshold improves performance 

considerably, but a system capable of responding in terms of figure -ground 
organization would clearly have a great advantage in such experiments. As 

the quantity of background material is increased, the performance of an 


elementary perceptron in detection experiments deteriorates rapidly. 


A striking difference between an elementary perceptron and a 
human observer in detection experiments is that the human will show vast 
differences in performance depending upon organizational properties of the 
background and its relationship to the figure. For example, the human 
observer will readily recognize the letter "E"' in Figure (a), but will find 
it hard to segregate the "E" from the extraneous lines in Figure (b). An 
elementary perceptron would show little or no difference between these two 


situations. 


Typical test patterns for detection experiments 
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8.5 Generalization and Other Capabilities 


In "pure" generalization experiments, where the test stimuli 
are disjoint from the training stimuli, the 7 -system has no advantages 
over the a¢-system. In fact, the binomial j’-system, due to its 
negative 9:3; for disjoint stimuli, will actually tend to place a disjoint 
stimulus in the opposite class from the reinforced stimulus, unless members 
of the opposite class have also been reinforced, in which case the effects tend 


to cancel. 


Where the training stimuli cover the retina in a representative 
sample of locations, the gamma system has the possible advantage of low 
or negative generalization to patterns which have small intersections with 
the trained patterns. This shows best in such experiments as the horizontal/ 
vertical bar discrimination experiment, where generalization from horizontal 
to vertical bars is zero. As was noted in the case of R-controlled discrimina- 
tion experiments, generalization in 7 -systems, as with all elementary 
perceptrons, tends to be based on the location rather than the similarity of 
the stimuli, in any more fundamental sense. Ideally, we would hope to find 
a system in which g;; is large for all pairs of stimuli, 5; and $; , which 
are "similar" or "equivalent" under some group of spatial transformations, 
such as rigid motions, dilatations, or projective transformations, and small 
or negative otherwise. Except in exceptional and highly restrictive 
environmental conditions, this condition is not to be found in elementary 
perceptrons. Highly artifactual organizations which have the required 


property can be designed in the case of four-layer series coupled perceptrons, 
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as will be seen in Chapter 15. Systems which spontaneously acquire the 
required organizational properties are found chiefly among the cross- 
coupled perceptrons, however, and will be discussed in Part III of this 


volume. 


In general, it is seenthat 7°-perceptrons have much the same 
properties as O¢-systems. In S-controlled experiments, especially with 
frequency and size bias present, they perform somewhat better, but in 
error correction experiments there is little to be gained from the gamma 
rule, and there is the possibility that the 2% -system may fail to work where 
an oc -system would have succeeded, as proven in Chapter 5. The 
performance in R-controlled experiments is somewhat more interesting 
than that of o¢ -systems, but the classifications which are formed spon- 
taneously tend to form on a basis of classification related to position of 
stimuli on the retina, rather than similarity, and are consequently of 


minimum psychological interest. 


The 9 -system may be somewhat more plausible as a biological 
memory mechanism, due to its fundamental conservative property. If 
biological memory is due to a physical process which maintains some over- 
all equilibrium, such as a chemical substance the total amount of which 
remains invariant, or a competition among afferent processes for ''Lebensraum" 
in the neighborhood of an efferent neuron, this property would certainly be 
indicated. It should be emphasized, however, that the conservation of the 
total value, as in the systems considered in this chapter, is insufficient to 
keep individual coupling coefficients, wz. , from becoming indefinitely 
great, since they may be balanced by negative values of equal magnitude. 
Such a condition is quite implausible in any real physical system. Inthe 
next chapter, elementary perceptrons with memory dynamics which limit 


the growth of the values are considered. 
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9. “ELEMENTARY PERCEPTRONS WITH LIMITED VALUES 


Two basically different mechanisms for limiting the growth of 
values, wv; , will be considered in this chapter. The first mechanism 
is a simple upper and lower bound, such that the value may grow up to the 
designated limit but no further. Systems employing this mechanism show 
"saturation properties" as the connections attain their limits. The second 
mechanism is an exponential decay, which determines an equilibrium point 
for each vj; depending upon the frequency with which it is reinforced. 

If the decay rate is very small, such systems tend to approach a terminal 
state resembling the performance characteristics of a perceptron with un- 
limited values after a long training sequence. Systems with strictly bounded 


values will be considered first. 


9.1 Analysis of Systems with Bounded Values 


Two types of analysis have been carried out for systems 
having upper and lower bounds for 7, . The first deals with the 
terminal distribution of the values after a long period of exposure toa 
random sequence of stimuli, with S-controlled reinforcement. The second 
deals with the actual performance of a bounded-value perceptron. In both 
Cases, we will follow the method of analysis originally employed by 
Joseph, in connection with bounded 7'-perceptrons (Ref. 41)". All of 
these analytic results apply to experimental systems using S-controlled 


reinforcement procedures. 


Bounded J*-systems have been called A -systems in Ref. 41. 
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9.1.1 Terminal Value Distribution ina Bounded o -system 


Suppose an co -perceptron has upper and lower limits 4 and #£ 
for the values 72, . Suppose a particular connection, c;, , receives 
a reinforcement of +1 with probability  , -1 with probability Q: and 0 
with probability /-p-g. If all stimuli are equiprobable, and the 
perceptron is trained by an S-controlled procedure, this would correspond 
to a connection from an A-unit with bias ratio p/Y (see Definition, Page 77). 
It is assumed in the following analysis that the reinforcements occurring at 
different times are statistically independent. For convenience, L and ¢ 
are taken to be integers. Then the value, 27.; , may assume any one of 


L-2+/ distinct states (¢,@+/,..., L ). Clearly, ifunit a; 
responds more often to stimuli of the positive class than to stimuli of the 
negative class, 727, will tend to grow in a positive direction. Eventually 
it will arrive at the limit 24 . At this point, a run of "negative" stimuli 
may bring it down again, but it can never exceed /L _ . If the unit has a 
negative bias, 72;, will similarly tend to remain in the neighborhood of 
the lower limit, @ . The problem is to find the terminal probability 
distribution (if one exists) for the value 7, , as the duration 7 of the 


training sequence goes to infinity. 


In the following analysis, it will first be assumed that a stable 
terminal probability distribution for 72, exists, which will not be 
altered by the addition of more stimuli to the training sequence. On the 
basis of this assumption, an equation for the distribution can be found. It 
will then be proven by induction that the proposed distribution is, in fact, 


a stable probability distribution. 
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Let 77(x) = probability that w,,= zx , in the terminal 
probability distribution. Let 7(4)=c . This will be equal to the 
probability of 7;, arriving at € from above, plus the probability that 
Vie remains in state ¢ if it is already there. Thus, 


Wih- “2-9 [ 14) + Tr (£+ 1)| + (1-p -g) 7) 


Hence 


W(lp1) = a - 7a (9.1) 


For any integer 244 41-4, 
WlLsi-t) = gW(lei)+ pil (L+i-2) + (i-p-g) 7 (e+i-1) 


Hence, 


1 (+i) ag W(Lri-1)- a (t+ i-2) 
(9.2) 
Thus, all values of /7(x) can be computed if the probability c of v;y 
being at the lower limit is known. Since the sum of J7 for all possible 


values of Vip must be 1, the value of < can be dbtained from the 


equation: 


L-4 
>. Whei)=t 
20 (9.3) 
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For the distribution to be stable, it is sufficient that the proba- 


bility of VY, being at its upper limit satisfies the equation. 


T(L) = piy(t-s 1-qg)i7(L) 
(L) = pIT(L-s) + (1-g)i7( ea 


By induction on ¢ _ , it will be shown that 
W(é+i) = p[T(4+i-1)] + (1-9) 7 (Lei) 
5 
- 2 or(eri-s) a 


for fei Ss L-£ . (9.4) is only a special case of (9.5). 

To begin with, for {=f , wehave J/7(@)= ¢ and from (9.1), 
WLlpi) = a . This clearly agrees with (9.5). Now assume (9.5) is 
true for j<«¢ te: rsL-£-1)\ . That is 


Tleprt) = $1 (4er-1) 
But by (9.2), letting ¢=2r#f , we then obtain 


Tlher+t) = — [Serr ] = 4g Tier -9) 


= Wee r) 


Thus, having assumed (9.5) to be true for £ =f , we find that it is 
alsotrue for «2 ##/7 ; consequently it is true forall ({ , and (9.5) 
must be true. From (9.5) it is also clear that the quantities 77 will 

all be non-negative, so that the function 77(x) meets the requirements for 


a probability distribution. 
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Equation (9.5) can be used to compute /77(x) by assuming an 
arbitrary value for «¢ ,and then normalizing the distribution as in (9.3). 
The equation can be simplified by taking the lower limit, @ , equal to 
zero, and ons #2=f forthe unnormalized distribution. Then 
IT(X) = G prior to popes izeuor For the normalized distri- 


bution, «-[E LG) le "77a ° - REET : 


This completes the proof of the {silewing theorem: 


THEOREM: Ina bounded a-perceptron, with S-controlled reinforce- 
ment, the probability distirubtion J7(v) (for the value of 
a particular connection) approaches a stable terminal 


distribution of the form (7 2 .(2)”“ where < 


is a normalization constant equal to i 


Figure 31 shows the probability distribution for v,, for 


several values of -& and for 40 increments between the upper and lower 


¥ 


limits. (The distributions are symmetric for equivalent values of 2 , 


p 


with upper and lower limits reversed.) Note that with even a slight bias 
rm # f¢ ) there is a very low probability that ;, will have a sign 
opposite to the bias. For — = .9 , for example (and taking @#= -20 
L<=+#20 , as inthe figure) the probability of a positive w;, inthe 
terminal distribution is only .0097. If the range were half as great (20 
increments instead of 40) the probability of positive wj, for the same 


conditions would be increased to .2295. 
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The frequency of possible ratios _ for A-units responding 
to horizontal and vertical bars can be determined from Table 1. From this, 
it is clear that the majority of units have a pronounced bias towards one 
class or the other, so that one might expect fo find the majority of active 
connections having values in the neighborhood of the appropriate limit, L 
or € . This heuristic argument supports the conjecture that the bounded 
system should still be capable of learning discrimination tasks in S-controlled 
experiments, even though the system tends to ''saturate"’, with all values in 


the neighborhood of the upper or lower limit. The quantitative performance 


of such systems will be taken up in Section 9.1.3. 


9.1.2 Terminal Value Distribution in Bounded 9 -systems 


In a bounded g*-perceptron, the analysis of the terminal 
distribution for 27, is complicated by two considerations. First, there 
are at least four possible values of 4a, namely /7-Q; , -/+#Q; , 
-Q@; , and #@Q; , each with its own probability. If @; is not equal for 
all stimuli, the number of possible values for Aw is increased in 
proportion to the number of different values for @; . The second 
consideration is that the conservation rule, which requires the sum of all 
values to remain constant, makes the admissible increment for one 
connection dependent on how many of the other connections are currently 
free to move. For example, if all of the "active'' connections have values 


@qualto L_, the expected decrement, —Q- 


eo for the inactive connections 


due to the application of a positive Aw cannot occur. 


-227- 


Google 


Due to these complications, an analysis for atrue 7’-system 
has never been carried out. An analysis has been completed by Joseph 
fora 7 -system with monopolar reinforcement (i.e., reinforcement 
is applied only for stimuli of the positive class, and 7 =O for stimuli 
of the negative class). In this case there are only two non-zero changes 
which might occur, /-@Q; for active connections and -Q; for inactive 
connections, and the reinforcement of a given connection does not depend 
on the state of any other parallel connections, as it does inthe j’-system. 
The analysis is a somewhat more complicated form of that presented in the 
preceding section (due to the inequality of positive and negative changes in 
Vz» ). Since the equations are of limited interest aside from the specific 
model considered, they will not be repeated here, but they can be found, 


together with typical distribution curves, in Ref. 41. 


9.1.3 Performance of Bounded o¢ -systems in S-controlled Experiments 


From the preceding analysis, it is clear that with a large 
number of increments between the upper and lower limits of w,;, , the 
value will ultimately tend to remain in the neighborhood of the upper or 
lower bound, depending upon the bias ratio of @; . In the following 
analysis, the problem is simplified by assuming that the limits are 
actually trapping, so that once a connection has arrived at value L or 


# , it remains there permanently, regardless of future reinforcement. 


Consider a basic training sequence of m stimuli, S,.--- S,, . 
which is then repeated a sufficient number of times to ''saturate" the 
system, i.e., to drive all biased values to their limits. If the value of a 
connection is 2 after the first »m stimuli, then after repetitions 


of the training sequence, the value will be 
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min (L,#v) if v>oO 
min (4, rv) if v<0O 
O if2=0 


for a bounded o¢-system. An unbounded coc -system will have the same 
performance after # repetitions of the training sequence as after a single 
repetition. The following analysis compares the performance of the 
"saturated'"' bounded co -system with that of the unbounded o¢ -system 
at the end of the training sequence. The analysis will be accurate for the 
assumption of a large range between L and @ , so that after the first m 


stimuli none of the values have reached their limits. 


Let fA, be the probability that R= +7 fortest stimulus S, , 
for the unbounded o<¢-system, and Pp, be the corresponding probability 
for the bounded co -system. Then the conditional probability (P,| P.) 
gives the performance of the bounded system as a function of the performance 


of the unbounded system (which is known from Chapter 7). 


* 
Suppose AW, A-units are activated by the test stimulus, Sy 
Then for the unbounded system, (Py | Na) = $ (3) where @ is the cumulative 


distribution function defined by equation (7.7) and 


= INe EC vie) 


o (vip) 


where E(v; p) = expected value of a connection activated by Sx » and 


o(v;,) = standard deviation of such a connection. The bounded oa¢-system, 
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on the other hand, will give response +1 if the proportion of the Nn, 
active connections having value 1 is greater than -£/(t-£) . If 
4Zx«=-L_, then this reduces to a requirement that the number of active 
connections having value L should be greater than the number having value @ . 
The connections having value 0 may be ignored. As with the unbounded 
system, it is assumed that after the first m stimuli, v;, is normally 
distributed with expected value £(1;,) and variance o *(u;, ) . This 
assumption is reasonable if the range of v7, , (L-4#) is greater than 2m 
and ™ is fairly large. If the range of 7, is less than 2m , the analysis 
can be considered only an approximation, which becomes increasingly poor 


as the range diminishes. 


Under these conditions, in the bounded system, the probability 
that the terminal value of a connection is L is equal to the probability that 
Vip is positive after the first m™ stimuli. This is equal to Gir) : 
Since ti) is a cumulative probability distribution it is a one-to-one function 
from its domain to its range, and is therefore invertible. Thus, given Fi 
and N.* , the probability A, that a connection activated by S, goes to 


value L will be: 
of 
(A|Ar Ne) - A 6 (Oe ) (9.6) 


and this yields 


Ne n# n*- 
CPelPe MED, (PDAIU-a) 4 
eh (9.7) 
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(Nn ‘ 


*a |,%0) 


P. FOR UNBOUNDED SYSTEM 


Figure 32 CONDITIONAL ERROR PROBABILITY FOR BOUNDED o¢-SYSTEM vs. ERROR 


PROBABILITY FOR UNBOUNDED SYSTEM. (4 = -L) 
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* 
y 4 
Me. ? | , the notation [7] indicating the least integer 


* , 
greater than or equalto ”» . = To obtain (P, | Py ) , the expectation of 


where r= 
(9.7) with respect to Na is required. For reasonably large values of WN, 


(P,' | P,) = (P,! | P,, E(NZ)) . Substituting Q,N, for E(N,*) this 
finally yields: 


(Pz |Px) * oe i (MM) afr p, )8xRa-y 


yer (9.8) 
g (P22 = | Sete ler N, |£| 
where 2 ts YQ,Na tae) r r ry va 


In Figure 32, the conditional probability of error in a bounded 
O¢ -perceptron is shown as a function of the error probability (1-P,) 
i |4\ 
for the unbounded system, for several values of N2 d L+iZ] 
taken to be 1/2. Curves of this function for cases where upper and lower 


is 
limits are not symmetric can be found in Joseph, Ref. 41 (Figures 10-14). 
9.1.4 Performance of Bounded 3'-systems in S-controlled Experiments 


The analysis in the preceding section, and the curves shown 

in Fig. 32, can be applied without modification to bounded 7'’-perceptrons. 
The true 9% -system, however, may perform somewhat better than the 
7 -system, since not all values can "saturate" independently. If more 
than half of the connections have a positive bias, for example, not all of 


the positively biased connections can go tothe limit L , since this would 


Ps 
It is assumed here that L>0, £<0. 
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require that the remaining connections take on values less than ¢ , 
in order to satisfy the conservation rule. Inthe 7 -system, therefore, 
we would expect a greater number of connections to remain at inter- 
mediate values, rather than going to the limits, and this should result in 
a "compromise" between the performance of an unbounded and a bounded 


value system. An exact analysis of the 7'-system has not been carried out. 


9.2 Analysis of Systems with Decaying Values 


The bounded value systems have two disadvantages relative to 
the "ideal'' unbounded systems. First, they permit a smaller number of 
memory states, and second, in S-controlled experiments they tend to 
arrive at a saturation condition in which their performance is actually 
poorer than that obtained during the transient learning phase; that is, 
their performance curve first increases to a maximum, and then declines 
to a terminal asymptote as the system saturates. The first disadvantage is 
not serious, if the range of 7;, is reasonably large. The second may be 
more critical, since it means that units with a low "utility" for a given 
discrimination are pulling as much weight in the saturated system as units 
with high utility (as measured by their bias ratios). In the cross-coupled 
perceptrons considered in Part III, this latter consideration is more 


salient than in elementary perceptrons. 
An alternative value-limiting mechanism, which is also of 
interest due to its apparent biological plausibility, is obtained by allowing 


the values to decay exponentially towards a resting state (generally taken 


to be zero). This mechanism is relatively free from the difficulties 
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encountered in the bounded value system. In this model, v;, will 

continue to grow in the direction determined by the bias ratio of a; , until 
the expected rate of reinforcement is exactly balanced by the rate of decay. 
At this point a dynamic equilibrium will occur, with z7,, tending to fluctuate 
about the equilibrium level. This means that connections which are frequently 
reinforced, in a consistent direction, will attain higher values, in the limit, 


than infrequently reinforced connections, or connections with low bias. 


Consider an of -system with decaying values. Let the decay 
rate be equal to S(f<</) . Let the probabilities of positive and negative 
increments to 7%, be p and g , as in the analysis of bounded oa -systems. 
As longas ¢& is small, v;, will tend to approach an expected asymptotic 
value equal to (p-g)/e . At this point, the expected rate of gain, per unit 
time, is p-g , and the expected rate of loss is ¢1,= 9-9. If the value 
of Sf is very small, and the relaxation time correspondingly long relative to 
the expected recurrence rate of stimuli from the environment, this system 
should approach as a limit the same performance as the unbounded a - 
system, where 21, tends to grow in proportion to p “9 - If f& is some- 
what larger, however, we find that the most recent stimuli in the training \ 
sequence will have the most pronounced effect, progressively earlier stimuli 
exerting a progressively dimishing effect due to the decay of 71, . Sucha 
perceptron tends to forget its remote experience in favor of more recent 


experience. 


The dependence of these systems on the sequence as well as 
the identity of training stimuli makes them difficult to analyze when the 


relaxation time, or "half-life" of V;, is on the same order as, or 
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shorter than, the training sequence. If ¢ is sufficiently small, per- 
formance can be assumed identical to the unbounded system. An absolute 
bound on the maximum attainable magnitude of 7, for a decaying value 


perceptron will be //( , corresponding to a situation in which Lr is 


reinforced continuously in the same direction. 


9.3 Experiments with Decaying Value Perceptrons 


9.3.1 S-controlled Discrimination Experiments 


The essential features of S-controlled discrimination experi- 
ments with decaying value perceptrons have already been noted in the 
preceding section. If the decay rate is small, the decaying value system 
approaches the performance of the corresponding "ideal" or unbounded 
system. If the decay rate is relatively large, forgetting occurs, which is 
greatest for temporally remote events and negligible for recent events in 


the training sequence. 


9.3.2 Error-correction Experiments 


In discrimination experiments with error corrective rein- 
forcement, a more complicated situation exists than in the case of S- 
controlled experiments. In the error correction system, once the 
perceptron has learned a task, reinforcement ceases, and the values 
of a decaying system would be expected to decay back towards zero. 


In a perfectly noise-free system, the values would all decay in proportion 
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to their magnitudes, however, and consequently their ratios would never 
change as long as no further reinforcement was applied. Thus once per- 

fect performance is achieved, it will not be lost as long as the values 

remain above the noise-level of the system, despite the decay effect. 

This also means that if a "run" of correct responses occurs during 

training, the ratios of v7, for different connections will be unaltered, so that 
the next error to occur will be no different in the decaying value model than 
in the unbounded model. Consequently, the application of reinforcement just 
sufficient to correct this error will bring the ratios of the values to precisely 
the state that they would have in the unbounded model, and ability to achieve 

a solution to a classification problem should be unaffected, in principle.. In 
actuality, however, the continuously decaying values clearly present a 
problem, since any physical system will ultimately forget, when the values 


become small enough to be undetectable. 


A variation of the decaying value model is capable of eliminating 
the problem caused by the diminution of the values in an unreinforced system. 
If 72, is held constant so long as no reinforcement signal is received 
from the reinforcement control system, but decays exponentially in the 
presence of such a signal, the learning ability of the perceptron will still be 
unaltered (by the same argument as above), and no change will occur once the 
task has been properly learned. This means that the increment to the value 


of V,;,p attime ¢ will be 


Avipl(t) = [at(t) -dee, ()] +9 (t) 
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where (tt) maybe+é ,-€ ,ordQ. 


It should be noted that in the error-correction procedure, the 
loss of temporally remote experience with large values of ¢ does not 
occur, in an ideally functioning (noise-free) system. Unlike the S-controlled 
system, where the magnitude of new reinforcements remains unchanged as 
the values decay, the error correction procedure will require smaller or 
less frequent increments in order to correct an error, and earlier experience 
tends to be retained about as well as in the unbounded, or non-decaying 
system. A loss of early experience does occur, in such systems, but it is 
due to "writing over'' earlier memory traces with more recent reinforcement, 
rather than to a passive decay, as in the case of the S-controlled system. 
This observation would seem to indicate a closer correspondence of the 
error-corrective system with what is known of forgetting in biological 


systems. 
The mean performance curves for eight simulated perceptrons 
with S=0O0 , fd =.-00! , and od=.0f are shown in Fig. 33. Note that 


for these actual systems, there is a progressive deterioration of performance 


as the decay rate is increased. 
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9.3.3 R-controlled Experiments 


The most interesting experimental results obtained to date 

with decaying value perceptrons deal with the performance of decaying 
7 -systems in R-controlled experiments. Experiment 9 has been 

studied most extensively, by means of simulation experiments repre- 
senting a very large, or infinite WV, , perceptron. Unlike the previous 
experiments (discussed in Section 8.3) monopolar reinforcement was 
employed, i.e., the perceptron was reinforced positively for prey ’ 
and was not reinforced at allfor g*=-/ . The system was further 
modified by assuming a slight negative quantity to be added to Av;, (t) 
forall ¢{ ; that is, an invariant negative reinforcement component was 
added uniformly to all connections, regardless of what stimulus occurred, 
and regardless of the activity state of the connection. In the absence of 
any other components, this would cause a progressive downward drift of 
all v;, until they achieved an equilibrium with the decay rate. It was 
assumed that this negative component was sufficient to add a quantity 
equal to -0.0001 to the set of connections activated by a single stimulus. 
Thus, apart from the decay, the change in values for each reinforcement 


could be expressed by the equation: 
9; = Q.; - Q; Q -O. ooor 


The effect of the fixed negative component in these experiments 
is to create a negative generalization from the first stimulus to occur 
(say a horizontal bar) to all members of the opposite class (vertical bars) 


in place of the zero generalization which would otherwise occur with a 
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g'-system. The result is that after having seen a single stimulus 
which activates a positive response, all members of the opposite class 
are thenceforth permanently classified in the negative class, as no 
further events can occur which will make one of them positive. If the 
initial stimulus is a horizontal bar, then, with monopolar reinforcement, 
no vertical bar will be reinforced, since all vertical bars evoke a -1l 
response. The next stimulus which can possibly be reinforced is, in fact, 
another horizontal bar which happens to be close enough to the previous 
one to have received positive generalization from the first reinforcement, 
i.e., the first or second neighbor on either side. The result is a gradual 
growth of the positive stimulus set, by accretion of near neighbors which 
have received positive generalization from those bars already classified 
as "positive''. Thus, having started out by randomly placing a horizontal 
bar in the positive class, the system has no choice but to include only 
horizontal bars in the positive class, and, with sufficient time, all 


horizontal bars are so classified. 


While this phenomenon occurs even if the decay rate is zero, 
it is markedly accelerated by a non-zero decay rate. With f=’ , the 
perceptron shows a high degree of "rigidity" in its early classification, in 
which some horizontal bars are positive, and the remainder still negative 
(as in Section 8.3). This is due to the continually increasing magnitude of 
the negative values evoked by the "incorrectly" classified stimuli, which 
must be overcome in order to change their classification. Thus, as time 
progresses, it becomes harder and harder to switch each additional hori- 


zontal bar into the positive class, since an increasingly large number of 
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"marginal" positive stimuli must be reinforced in order to obtain the 
required amount of positive generalization. Moreover, as the positive 
class expands, the stimuli which are centrally located within the "positive 
band" all contribute further negative generalization to the remaining 
stimuli, rather than helping to make them positive. These combined effects 
lead to a convex, negatively accelerating learning curve, as illustrated in 
Figure 33. The addition of a non-zero decay rate limits the negative value 
which must be overcome in order to change the classification of an 


"incorrect" stimulus, and thus makes the system more flexible. 


If the decay rate is increased progressively, it is found that 
there is an optimum at about o = 0.01. If the decay rate is increased 
further, instability occurs, due to the loss of stimuli which were previously 
Classified correctly, but whose positive values have decayed to such an 
extent as to be overcome by negative generalization from other stimuli. 
These effects are shown both in the learning curves of Fig. 34(a) and in 
Fig. 34(b), which shows the expected learning time to perfect performance 
(i.e., perfect dichotomization of horizontal and vertical bars), obtained 


from a sample of 10 runs. 


It might seem, from these results,that perceptrons organized 
in the manner indicated could be expected to form "meaningful" classi- 
fications of stimuli, on some basis other than retinal position. Unfortu- 
nately, the results, while illuminating, are highly restricted in generality. 
The proposed dynamics are too contrived to be biologically plausible, and 


it is found that in any environment in which classes of stimuli to be 
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differentiated permit positive generalization between members of different 
classes (a much more usual situation) the mechanism which yields good 
separation in the above example breaks down. If g,;  betweena single 
horizontal bar and any of the vertical bars were positive, for example, 
the spread of generalization would not stop with the members of the 
horizontal class, in the above case, but would invade the opposite class 

as well. If, instead of 4 by 20 horzontal and vertical bars, the perceptron 
is confronted with an environment consisting of the twenty horizontal bars 
and a set of twenty pairs of parallel 2 by 20 horizontal bars, separated by 
a space of 3 units on the retina, the perceptron will not spontaneously learn 
to distinguish single bars from double bars (although this task presents no 


difficulty in an S-controlled experiment). 


Another shortcoming of the spontaneous organization phenomenon 
which has been demonstrated here is the basically unbiological character of 
the learning curves. It has already been noted that these curves are convex, 
or decelerating. A human subject, or even an animal subject, confronted 
with the problem of distinguishing horizontal from vertical bars might make 
many mistakes initially, but would soon accelerate his learning as he began 
to generalize to new stimuli. If he had a hundred bars, in different retinal 
positions, to classify, the hundredth bar would certainly not present the 
almost insurmountable obstacle that it represents for the elementary per- 
ceptron. Thus it is clear that the most sophisticated generalization phe- 
nomena which have yet been found in elementary perceptrons are still far 
short of what one should expect from an adequate brain model, if biological 
standards are employed. This problem will be re-examined at greater 
length in Part III, where it will be seen that multi-layer and cross -coupled 
perceptrons perform such tasks in a much more suitable fashion than those 


systems which have been considered thus far. 
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This completes the presentation of elementary perceptrons. In 
the following chapters, some other types of minimal (S-A-R) perceptrons 
will be considered, but it will be seen that none of these have capabilities 
for generalization appreciably beyond those discovered in the elementary 


systems. 
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10. SIMPLE PERCEPTRONS WITH NON-SIMPLE A AND R-UNITS 


In Chapter 4, a simple perceptron was defined as one which 


satisfies the following five conditions: 


1. There is a single R-unit, with a connection from every A-unit. 
2. The perceptron is series coupled, with an S-A-R topology. 


The values of all S-A connections are invariant. 
4. Transmission times of all connections are equal ( 7 generally 


taken as 0). 


5. All signals generated by S, A, and R-units are functions of 
the algebraic sum of input signals arriving simultaneously 


at the unit. 


In the preceding chapters, we have considered elementary 
perceptrons, which are characterized by the additional constraints that all 
A and R-units are "simple" units, and that the transmission function of the 
connection <¢;; takes the form: “i; (t) = a: (t-T) Vv; (t) . A 
simple A-unit is a signal generating unit which emits an output signal 

a: = +/ if the algebraic sum of the input signals, co; , is equal 
or greater than the threshold G@ , and O otherwise. A simple R-unit 
emits a +1 signal if the sum of its input signals is strictly positive, and -1 
if the sum of its inputs is strictly negative. In this chapter, we shall 


consider the properties of simple perceptrons in which these contraints 


are dropped. This will include a brief consideration of linear networks 
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in which all signals are transmitted in proportion to their value; the 
properties of perceptrons with linear R-units but non-linear A-units will 
then be considered, and finally the question of optimum transmission func - 
tions will be discussed. In later chapters, the remaining constraints of 
simple perceptrons will be modified, and a number of non-simple systems 


will be analyzed. 


10.1 Completely Linear Perceptrons 


A completely linear perceptron is one in which all signal functions 


and transmission functions are linear, i.e., the output of unit «,; is of the 


form u; = £; 0; , and the signal transmitted by a connection ¢; , is 
of the form “ji = u; Vi: We will consider linear perceptrons in 


environments such that the inputs to an S-unit are either 1 or 0 (so that the 
conclusions apply equally well to perceptrons which are linear everywhere 

except in the S-units). By analogy to Section 5.4, we define the bias ratio 

of an S-unit as ne/n- » where m? is the number of positive stimuli, and 
m~ the number or negative stimuli which activate the S-unit. For such 


systems, the following theorem holds: 


THEOREM 1: Given a completely linear perceptron, a stimulus world, 
W , anda classification C(W) such that the bias ratio of 
every S-unit is equal (and non-zero), no solution to C(w) 


can exist. 


PROOF: Let 4% = index of any stimulus in positive class (S,+) ‘ 
4” = index of any stimulus in negative class (S¢-) ‘ 
: th : 
4 = index of 4 sensory unit 
“i (4) = signal transmitted from the Pia sensory unit 


to the Phas A-unit in response to stimulus S, 
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When stimulus S¢g occurs, unit a; transmits a signal equal 


to a;(4)v-, tothe R-unit, where 
oe: (4) = 2) ef; (4) 
The total signal, ug , received by the R-unit from Sy, is therefore: 
uae Dail) vie = DD cts) vir 
é 


Since every signal u, must agree in sign with the classification of Sy 
for a solution to exist, we require that the following inequalities be satisfied: 


DLL), ai (4*) Vip > 0 


¢ & &€F (10.1) 


ddd Kai (47) Vip <0 (10.2) 
a” a wa 


But it has been stipulated that the bias ratio of each S-point is equal to a 


constant, > 0 . This means that, for any « and 4 , 


Peta) = Lebar) (ero) 
+ A 


or,summing over S-units, 


2d, cai (4*) - Li (A>). we 


Substituting in the expressions (10.1) and (10.2) we get the contradiction 


Je tee 
+) Vip <0 
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which proves that a solution cannot exist. 


This means that if two stimulus patterns are placed in all 
possible positions on a retina, the resulting classes of stimuli cannot 
be correctly discriminated by a linear perceptron. As a consequence, 
such systems are relatively uninteresting, even though they may successfully 
discriminate a moderate number of patterns which are restricted to limited 
positions on the retina. In all systems considered from here on, there will 
be at least one set of non-linear components subsequent to the S-units in 


the perceptron network. 


10.2 Perceptrons with Continuous R-units 


The next type of perceptron. to be considered has simple A-units, 
but continuous R-units, such that the response rs 2 &(u;) , with 4 an 
arbitrary monotonic function of «; . This includes the case of linear 
R-units, where 4£(u;) = cu; . An important theorem which is 
analogous to Theorem 4 of Chapter 5 deals with the ability of such systems 
to learn arbitrary response functions (Definition 27, Chapter 4) under the 
error correction procedure. A response function assigns an arbitrary 
output signal (rather than just + 1) to every stimulus in W . We first 


prove the following Lemma: 


LEMMA 1: Given a symmetric positive definite or positive semidefinite 
matrix, A, and any vector 2° then (3, H3)=0 only if 
Hy - O 
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PROOF: Since +H is positive definite or semidefinite, there exists a 


matrix 8 suchthat 4 =8'8. 


O = (3,43) = (83,43) 
=> 8B, =0 => 0 = BB; =H; 


THEOREM 2: Given a simple ce -perceptron with simple A-units, an 
R-unit with a continuous monotonic sign-preserving 
signal generating function, a stimulus world W (in which 
each stimulus ultimately reoccurs) and any response 
function @&(w) for which a solution exists, then by 
means of the error-corrective reinforcement procedure, 
the given response function can always be approximated in 
finite time by an output vector R(W)+é , where é€ 
is a vector of elements (€,, €,,--+,€,)> le;| < 6% 
where €’ may be an arbitrarily small quantity greater 


than zero. 


PROOF: The following proof was suggested by R. D. Joseph. From 
Theorem 3 of Chapter 5, we know that under the conditions of the theorem, 
a solution 27~ to the equation Gr=u exists. Suppose the system is 
currently in the state x , represented by Gy = x . From the definition 
of the G-matrix, and the fact that every stimulus must activate at least 


one A-unit for a solution to exist, we have 


f = Ji > (G2) 55, ag 
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The difference between the solution vector « andthe present state X 


is given by 
G(vr-y) =u-x 


Let v-ys3g and 
“a-xeuw 
Then G3 =u“. 
We wish to show that by applying an error correction method to one 
component at a time of the vector 3 >< must ultimately go to a point 
within the ¢° cube about 0. (The method will apply a correction of the 
proper size until a response pr * wp" is obtained.) We know that u; “2, 9:5 %j° 


Therefore, for the difference, «~- , we have 


an; = d. 95 3, 


Ou; 
oi op 946 >0O 


93¢ 
53 oF 
Since G is non-negative definite, we know that F * (7>G3) =O, 3: a aa 
é 
and from Lemma 1 we know that if «- #0, F>O . Therefore, if 


av;>O decreases as a result of decreasing g- , F decreases; also, 
if «-;<O increases by increasing 32 F decreases (see Proof of 
Theorem 4, Chapter 5). To prove the theorem, it is sufficient to show 


that this implies that «~ must ultimately enter the ¢€” cube about zero. 


Let «; = initial value of 4; at start of a correction step 


3: = initial value of 3; at start of a correction step 
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Then for the correction, we have 


Bur: = ~ Lv; 
Pur, sas - 
23: Se 
7a) . =z = <4 
<< i: 
OF ‘ ; 
83; =2luv; = [ 7; OG (3; de )] 
yee 3 we 


‘ 9 


2* dz; = 2] [erst 93; (2-30) 3; 


a3; 
3: 3; 
2 © 954 
AF = ai [a;"+ Si (3:-3;°)] 
a 
,2 
Ae 
Gee 


2 wa 
Therefore, 4 < -“zs; << -€ 


Hence, there can be only a finite number of corrections, since fF 20 , 
and the vector «- =u-2x must converge to a point within the ¢” cube 

about zero. But « is the input to the R-unit. Since r“(u) is continuous, 
there exists an ¢€” suchthat [r°(urd)-r(u)|<e if |o| se” . Ther e- 


fore the response function coverges together with the vector .- . Q.E.D. 
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The following Lemma and Corollaries establish that the various 
weaker forms of correction procedures are also capable of yielding a 


solution to R(W) . 


LEMMA 2: For the same conditions as Theorem 2, given that a 
solution exists, the set of all solutions forms a hyperplane 


of dimension equal to the nullity of G. 


PROOF: Let Gzx=u bea solution. Of necessity u; = 7; . Let 
Gy =u _ be another solution. Then G(x-y)=O . consequently z-y 
is in the null space of G . Conversely, if 3-2 is in the null space of G, 


then 6G (3 -x)=0O. Therefore, G3=u , sothat 3 is a solution. Q.E.D. 


COROLLARY 1: For the conditions of Theorem 2, and a phase space which 
is unbounded in all dimensions, the probability of conver- 
gence to an arbitrarily close approximation to R(W) by 
means of a random-sign correction procedure or a random- 


perturbation correction procedure may be less than 1. 


PROOF: The random-sign and random-perturbation procedures were 
defined in Section 5.6. k(W) is taken to be any response function, 
obtainable by an R-unit with a monotonic signal generating function. For 
convergence to occur, it would be necessary that a series of steps by 
increments of fixed magnitude, |77| , but of random sign, should carry 
the system from its initial state to an arbitrarily small distance, ¢€ , 


from its required state. From Lemma 2, the solution states form a hyper- 
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plane of dimension equal to the nullity of G, which has zero measure over 
the phase space of the system. But a random walk of the type described 
may carry the system arbitrarily far from its starting point, in a random 
direction, and the probability that a vertex of this path will fall within a 


distance ¢€ of the solution hyperplane may be less than unity. 


COROLLARY 2: Given the conditions of Theorem 2, and a phase space 
bounded in all dimensions, then (given that a solution to 
\  R(W) existe in this bounded space) the response function 
can always be approximated by means of the random-sign 
correction procedure, the system converging in finite time 
to an approximation R(W)+é€, € a vector, where 


le;| < for arbitrarily small €°>O 


PROOF: Since the phase space is finite, the set of solution points within 
the bounds defined above has positive measure. The random-sign correction 
procedure cannot carry any of the A-unit outputs beyond the limit set for its 
value; therefore, if the values approach their limit in any direction, a ran- 
dom walk in the opposite direction will follow. This procedure will 
ultimately take the representative point of the system into every set with 
positive measure, provided / is sufficiently small. Consequently, a 
solution within the bounds stated by the theorem will be obtained in finite 


time. 


COROLLARY 3: Given the same conditions as Corollary 2, the 
response function can always be approximated by 


the random-perturbation correction procedure, the 
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system converging in finite time to an approximation 

R(w)+e€ , € having elements of magnitude |e;| < |77| 
if the reinforcement is quantized, or le;| € e€’ > O ’ 
if 7 is chosen from a continuous distribution around 


zero. 


PROOF: The proof follows the same line as that of Corollary 2. Since 
each connection can be set to an independent value, in the quantized case 
the total error over the set of all connections need not be greater than 7 , 


while in the continuous case it may be made arbitrarily small. 


Theorem 2 and its corollaries indicate that it is possible to 
teach a simple. perceptron to produce responses which are proportional to 
some metric feature of the input stimuli, such as their size, or coordinates 
of their center of gravity on the retina. In the latter case, the output of 
such an R-unit can be fed back to the optical system to control the centering 


of a stimulus in the field. 


10.3 Perceptrons with Non-linear Transmission Functions 


In all perceptrons considered thus far, the transmission 


functions of connections from A-units to the R-unit have been of the form 
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We will now consider functions of the more general form: 
* * 
Lip *F(@; 1 Vir) 


Where time is not specified, this is understood to mean 
£3, (t) = # (a? (t-1), v;,()) 


Since a; is afunction of the input signal, ac; , the transmission 
function can be written in a still more general form (allowing for various types 


of signal-generating functions in the A-units), 
# 
Lip (t) = Fe; 1 vir) 


This form will be employed in the following theorems. 


THEOREM 3: Given a simple perceptron with a simple R-unit, and with 
transmission functions for all A-R connections of the form 
f (o¢;)uip, , where # is any function, and given the 
existence of a solution to a classification function C (WwW) 
for this perceptron, then if (1) is any polynomial of 
odd degree in 2~ , there also exists a solution if the 


transmission function is changed to f (a; ) Pp (17; 7) 


PROOF: A polynomial of odd degree can assume all possible values. 
Therefore if 2;,. is the original value of the connection “4,7, , there 
exists a solutionto p(x) = 7; yielding a new value, x , for the 
connection £,;, which will cause it to transmit an identical signal under 


the new transmission function. 
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THEOREM 4: Given the perceptron of Theorem 3, if a solution exists 
for some transmission function f(x;)wv;,-  , a solution 


does not necessarily exist for the transmission function 


gQaduip, g FF - 


PROOF: Suppose the number of A-units is equal to the number of stimuli 
in W . Let 8 = matrix of elements 5;; representing the value of the 
function f(a; (j)) which is the coefficient of 2, for stimulus S$ ; 
Then for a solution to exist, there must be some vector V and some 

vector // in the orthant required by C(W), suchthat 8V=U . But if 8 
is singular, there must be some ((W) for which no solution exists. This 
can be demonstrated by noting that each C(W) requires a solution vector in 
a different orthant, the set of all C(W) requiring solutions in every possible 
orthant. But if 8 is singular, it maps the entire space into a hyperplane, 
and this plane must fail to intersect certain orthants. Consequently, the 


functions C(W) which are represented by vectors in those orthants have no 


solution. Now consider the following cases: 


CASE 1: For the transmission function oz , let the matrix 

. & 2 

eft 23 

\2 3 4 
This is singular, and consequently there are some insoluble classifications. 
Now change the transmission function to oa » yielding ] 1 ] 

B= ] 4 9 
4 9 16 
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This matrix is non-singular, so that with the non-linear transfer function, 


all classifications are soluble. 


CASE 2: In this case it is shown, conversely, that there may be situations 
in which a linear transmission function will yield solutions which are un- 
obtainable with a particular non-linear function. Let the transmission 

3 5 8 


function be «7, , with the matrix B = 4 12 15 This matrix is non- 


5 13 17 
singular, so there is a solution for every C(w). But now let the transmig- 


; : 2 ee 25 64 
sion function be o~2~. Then 8 = 16 144 225 which is singular, 


25 169 289 
implying that there is some C(W) with no solution. 


THEOREM 5: Given a simple perceptron with A-R connections which 
differ in their transmission functions (or with uniform 
transmission functions but non-simple A-units) a response 
function &(W) may have a solution which is unattainable by 
either the error correction procedure or the random -sign 


correction procedure. 


PROOF: Consider a perceptron with a single sensory unit and two A-units. 
Let the R-unit be a linear amplifier with gain of 1. Let the sensory unit 

emit signals 0, 1, or 2 depending upon the intensity of the stimulus. The 
required response function is RK(W) = (0,+/,-/) corresponding to a null 
stimulus, a low-intensity stimulus, and a high-intensity stimulus, respectively. 
Let the transmission function of Ly, be |a|z , and the transmission function 


of Lo, be oz. The response function &(W) then has a solution if we 
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set 27,7 2.5 and v2,°* -/.5 . But this is the only possible solution, 
and is unattainable by the error correction or random-sign procedures, since 
both connections are always activated together and consequently must always 
be equal in value under these procedures (assuming that their initial values 
are equal). This example is sufficient to prove the theorem for the case of 


non-uniform transmission functions. 


For the second case, in which all transmission functions are 
uniform, but the perceptron has non-simple A-units, consider the following 
perceptron: 


Qy 


The values of all S-A connections are +1, and the A-units are both linear, 
with transmission function oz. Let the environment consist of the two 
stimuli S$, = 4, and S, =(4,, 4,) . Thena solution exists to 
the response function & =(+/,-2) , namely v7, = +3, U2,7=7-2- 
However, the error-correction or random-sign correction procedures will 
not work, since both A-units are always active (where ''active'means that 
they emit a non-zero signal). Note that a solution also exists to the 
classification (+/,-/) for this perceptron, and that this is also 
unattainable by the methods indicated. 
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The sixth theorem was proposed by R. D. Joseph. 


THEOREM 6: Given a simple perceptron with any mixture of transmission 
functions f; (ce jo Vp) for the connections ¢;, , and 


a response function @(W) for which a solution exists; then 
there exists some transmission function 9 (a«,7-) which 
is uniform for all connections, such that a solution to @(W) 


exists. 


PROOF: Let f;(~;,2;,) = signal from unit a; when stimulus 5; 
occurs. Then we can fit a polynomial 


n-t 
r é,. 
fj (06; (5, Vp) a met) } 


| 


for each stimulus S; + The coefficients, aoe mae (which depend on the 
A-unit, a; ) can be replaced by polynomials 
Nat 
xe = £463) =) bes ras 


Thus we have, for all values of /, . 
. 4 &,. ; 


which satisfies the conditions required by the theorem for g(«,2~) 


ifwe set 27, =). 
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It should be noted that this theorem applies only to a given response 
function for which a solution exists; if a different response function also has 
a solution, then there will again be a uniform transmission function for all 
A-units which will solve the problem, but this transmission function may 


differ from the one obtained for the original response function. 


We have seen in Theorem 5 that if the connections differ in 
transmission functions, or the A-units differ in signal generating functions, 
response functions may have solutions which cannot be obtained by the more 
systematic correction procedures. The following theorem proves that in 
this case the weakest of the correction procedures (the random perturbation 


method) can still be used successfully. 


THEOREM 7: Given a simple perceptron with an R-unit which is either 
simple or has a continuous signal generating function, 
and with any combination of transmission functions from 
its A-units (all continuous functions of 27, , equal to 
zero if oc; =0  ), and given a bounded phase space 
within which a solution exists for P(W) ; then, if each 
stimulus in W ultimately reoccurs, an approximate 
solution @(W) + € is always attainable in finite time 


by the random-perturbation correction procedure. 


PROOF: For an R-unit of the specified type, and a bounded phase space, 
the solution set has positive measure, over the region defined by P(W) + € 
(where € consists of arbitrarily small elements, €; ¢ €° ). To achieve 


an approximate solution within this set, it is only necessary to adjust the 


values of the active A-units for each stimulus. Since, under the random 
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perturbation procedure, each active connection will independently tend to 
assume a value in every admissible range with positive measure, the active 
set of connections as a whole will ultimately attain a value configuration 


within the solution set. 


10.4 Optimum Transmission Functions 


The general conclusions of the preceding pages are that while a 
completely linear perceptron does not work satisfactorily, there are many 
possible transmission functions which seem to work quite well. For many 
of these, there is no choice to be made from the standpoint of ability to 
achieve a solution, for they all seem to be capable of solving the same 
problems equally well. From the standpoint of efficiency of discrimination 
and speed of learning, however, the various transmission functions might 
differ considerably from one another. In this section, making use of an 
analysis due to Joseph, it will be shown that with some fairly weak constraints 
on the system under consideration, an optimum transmission function exists, 
and that this takes the form of a quadratic function of 7;, rather than a 


linear function. 
The constraints on the system to be analyzed are as follows: 


1. The analysis deals with S-controlled discrimination 


experiments, with a fixed training sequence. 


2. The conditional distribution of vane for connections activated 


by a test stimulus of the positive class, S$, , is assumed to be independent 


x 
of the choice of 5S, . Similarly, the distribution of 2~. for active 
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connections is assumed to be independent of the exact choice of 5, when the 


test stimulus is selected from the negative class. 


3. It is assumed that the conditional distribution of 7,, for 
the connections activated by Sy is a normal distribution, and that either 
the distributions are different or the probabilities Q; are different, for 
test stimuli in the positive and negative classes. These constraints will 
generally be met satisfactorily if the positive class consits of all possible 
positions on the retina of a large stimulus, and the negative class consists 
of all possible positions of a small stimulus. The main requirement is one 
of equivalence of stimuli within each class, and dissimilarity between classes, 
with respect to the distribution or number of signals transmitted from A-units 


to the R-unit. 


The discrimination problem can be stated as one of testing a 
hypothesis about the test stimulus, i The response unit is required 
to test the hypothesis that S, is a member of the positive class against 
the possibility that it is a member of the negative class. If the test stimulus 
is a member of the positive class, the output of an A-unit (subject to the 


above assumptions about the system being analyzed) will have the distribution 


0 with probability /- Q, (+) 
(10.3) 


! 2 
UV ~ph/,)) 
2o2,' eat } 


Ov (+) 
vu with density function ——= exp.5- 
‘ y2i7 Tre) e f 
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where Q@,(+), 07%) , and </,, are the parameters characterizing 
stimuli of the positive class. Similarly, if the test stimulus is a member of 


the negative class, the output of an A-unit will have the distribution 


0 with probability /- 0, (-) 
one 


: E aS -) 
2 with density function exp-[- 55 Ione | ay 7} 


where @,(-), oO.) » and ftv.) are the parameters characterizing 
stimuli of the negative class. Thus, the problem can be restated as one of 
testing whether the output of an A-unit has the distribution (10.3) or the 
distribution (10.4). 


There is thus a simple hypothesis (dealing with a single distribution) 
and a simple alternative. As Joseph has observed, under these conditions, 
for any significance level, the likelihood ratio test is most powerful. In 
performing this test, we would make NV independent observations of 27~ 
(corresponding to a sample of N A-units with independent origin point 


configurations), and obtain the likelihood ratio: 


Qy (+) ne SLO 
(tN Cae an te ie sig Bes?) 
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where Nis the number of active A-units, and the summation on « is over 
active units only. If 4 is greater than a preassigned constant 4, , we 
accept the hypothesis that S, is a member of the positive class; if < 

is less than 4, , we accept the alternative, that S, is a member of 

the negative class. The constant L, , corresponding to the threshold 

of the R-unit in a perceptron employing this procedure, determines the 

power and significance of the test. (The "significance" is measured by 

the probability of erroneously rejecting a positive stimulus, and the "power"! 
is the probability of correctly classifying a negative stimulus.) In logarithmic 
form, the condition 4 = L, becomes 

& sea seb ut (Le Se) ae BE oe Retna Dea 
2q7 cra 5 Y 


%=) 7+) ato Qx (-)(1- Qy (+) Ms) 


~ ee oe iad 
(1- Q, (4))% 


Thus, the required test is effectively performed if the perceptron is designed 


with R-units having a threshold n Lo5+N&n ae and the transmission 
x 

functions from A to R-units are of the form 

O if 0 < O 
F (co, v) a 

2 2 
pe PA. (+) (+)(1-O,¢ 
( ae a v- —- ee a eases zea ay 2 creas (1 Qx Yo.) ifa 26 
£o7.) 2O 14) oe ) T+) 2o7/.) 227) Qy ((1-Qy (or) 


The actual savings that might be obtained by the use of such a 
quadratic form have not been investigated numerically. In practise, they 
are probably slight. A further discussion of the optimization problem, inclu- 
ding the optimization of the upper and lower bounds in a bounded value per- 


ceptron, can be found in Joseph, Ref. 41. 


Prof. A. Gamba, in a related paper, has observed that not only the trans - 
mission functions but the reinforcement rule might be profitably modified 
in order to optimize the overall decision function of the system (Ref. 23). 
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ll. PERCEPTRONS WITH DISTRIBUTED TRANSMISSION TIMES 


-One of the requirements for a simple perceptron is that the 
transmission time, TZ; ; , should be equal for all connections, <; 8 
In this chapter, we consider the consequences of allowing a distribution 
of transmission times. It is obvious that under these conditions the set of 
A-units active at time ¢ will depend not on the single momentary stimulus 
occurring attime t-72?, but rather on the entire sequence of stimuli 
occurring between f- aera and t- Ook We shall first consider the cases 
of binomial and Poisson models where 7; jis distributed with a discrete 
spectrum, 7:: always being an integer equal to or greater than 1. We 


v4 


shall then consider the case of a continuous Gaussian distribution for 7; - 


11.1 Binomial Models with Discrete Spectrum of T; ; 


For the binomial case, we shall consider only the case where 
each A-unit receives a fixed number of connections of each type (excitatory 
and inhibitory) with T-. 5 Ff 1, and a fixed number with T-; = 2. 


Specifically, the parameters of an A-unit are: 


@ = threshold (defined as usual) 

x, = number of excitatory connections with 7; > = / 
y, = number of inhibitory connections with 7;° =/ 
a number of excitatoryconnections with 7: . = 2 
y, = number of inhibitory connections with 7; ; = 2 
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Models with a greater number of possible values for 7Z:;- can be analyzed 
by extensions of the method applied here. The object of the analysis is to find 


Q: and O;; at time ¢ , as functions of the two-step sequences of stimuli: 


6; = S:°(t-2), S;(¢-1) 
Je = S;(t-2), S;(¢-1) 


The notation 5S; will be used consistently to denote the stimulus preceding 
the terminal stimulus in sequence J; . Similarly, in sequences of more 
than two stimuli, 5,” will be used to denote the third stimulus from the 
end, etc. In the present model, sequences of length greater than 2 need not 
be considered. If it is assumed that A to R-unit connections all have equal 
transmission times, the analysis of performance in terms of the Q-functions 
will be identical with the analysis for simple perceptrons, the important 
difference being that the perceptron is now learning to recognize sequences 


of stimuli, rather than isolated momentary events. 


The total input signal to an A-unit attime ¢ , a(t) , is now 


a sum of four components, namely, 
o¢(t) = &, + &,-I,-T, 


where £, = number of excitatory connections with 7 = 1, having origins 


active at t-/ 


I, = number of inhibitory connections with 7 = 1, having origins 


active at ¢-/ 


&£, = number of excitatory connections with T = 2, having origins 


active at ¢t-2 
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I, = number of inhibitory connections with 7 = 2, 


having origins active at ¢-2. 


As usual, a; (t) ={ if oc; (¢) 2 6 , andO otherwise. Q; is then 


given by the following equation, which is analogous to (6.1): 


Qe = DP, Pu, (E) Peg (Ex) Py, (11) Py, (Ty) (11.1) 
E, + Eq-I,-I, 20 


where the probabilities Py, » Puy, Py, and Py, are defined as in (6.2), with 

the substitution of the appropriate parameters, and the stimulus measures 2; 

in the expressions for , and Py, and &.- in the expressions for Py, 

and Py, 
In a similar manner,the expression for Qi; can be obtained by 

the extension of the treatment employed in Equations 6.5 and 6.6. However, 

there are now eight components to be considered for o¢ for each stimulus 


sequence. Specifically, 


of(i) = E; t E+E; t&.. a7; -I 


Cj) E+E tbe Eee - Le WL WL - Ly 


where £; and i; are defined, as before, as the excitatory and inhibitory 
components originating from the set of retinal points situated in $; and 
notin $; , &;- and I;- are the corresponding components originating 
from the set of retinal points situated in §,- but not in $;? ,and E; ) 


Tj, Es, and re are similarly defined. Likewise, &, and I, are the 


y 
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excitatory and inhibitory components coming from the retinal set common 
to S; and > ,» and Ee and rn? » are the components from the set 


common to S;' and 5;’ . Thus we have the equation 


Q;; = ) Py (E:, Eo £) Py (Iz1T 312.) Pe CE", Ey, E+) hy (Ti ce 
ali) 26 
(7/24 


The required multinomial probabilities being computed from equations (6.6) 
_with an obvious extension of the above notation to the quantities A; , A- , 


‘ J 
Cc ’ A t a3 A; a 4 and C ° 


Since the Poisson model is much easier to compute, and has 
properties which are similar in all essentials to the binomial model, no 
numerical examples are given for the binomial model, but examples for the 


Poisson model can be found in the following section. 


11.2 Poisson Models with Discrete Spectrum of 2; j 


The Poisson model to be considered again has two values of 7, 
namely t= land t = 2, the parameters <z,, ¥,, y,, and ¥Y, 
being defined analogously to x and y in the Poisson model considered in 
Chapter 6. The equations for Q; and Q:; can, of course, be developed 
by extension of the equations of Chapter 6, as has just been done for the 
binomial model. A considerably simpler approach is possible in the Poisson 
model, however, if the corresponding stimulus areas attimes ¢-/ and ¢-2 


In this 


tA 


are also equal, i.e., A; = A; ; A; = Aj" »and C=C. 
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case, the previous equations (6.1, 6.3, 6.5, and 6.7) hold without modification, 
except that xX = X, + X2 and ¥ =Y:7Y2- More generally, the previous 


@quations can always be employed by making the appropriate substitutions: 


= 2, Ri + 2, Ri’ 


MR RR &] 
VY > B» 20 
il 


= 2%,R; + 2,8," 
HA; + 22 Aj! 
= £,Aj + 2, 4A;" 
= %,C + 2,C° 


Ni 


ad) 


and similarly, for the inhibitory components. If x,= %, and y, = Y2 
the equations for @; and Q; ; again become identical with the equations 

of Chapter 6 where R; = $(R; +R), Az = (A; + Az’) , etc. 
By an obvious extension to a spectrum with three or more values of 7 , 
where x," %,=.-.=%X, ,and Y, = ¥, = --+"y,» we can apply 


the same equations, substituting the parameters 


R; 


A, = L(A, + Ay t Ape en...) 
C= (c+ CEO” FH dnc) 


~ St (eR; + Rie + Ryne...) 


and similarly for R; and A; 
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As an example of the performance of such a system, consider 
a Poisson model perceptron with an expected value of 6 excitatory and 6 
inhibitory connections to each A-unit and G = 2. Let the environment 
consits of a set of 4 by 20 vertical bars, such as were employed in the 
experiments of the preceding chapters. The object will be to discriminate 
a bar arriving at a certain fixed location by movement from the left from a 
bar which arrives at the same location by movements from the right. Clearly, 
if a single value of 7; - 
first the case in which half of the excitatory and half of the inhibitory connections 


is permitted, this task is impossible. Consider 


have Zz =1 and the remaining half have 7 =2, sothat z,=%,=y,=y,=3- 
Let sequence J; denote (5S, (t-3), 5, (t-2), S(t -1)) and J: denote 

(Sp (t-3), Sg(t-2), S.(t-1)) » where S,,.---,; Se represent successive 
adjacent positions of the vertical bar onthe retina. Then @Q- = Q;; = .153, 


and Q = .094. Next, suppose one third of the excitatory connections 


ty 
and one third of the inhibitory connections have delays 7 = 3  , one third 
have t=2 , and onethird have 7 =/ , sothat Z, = 7, = x; =4, "4 = 93 =: 
In this case, @,.- = .153, as before, but Q;; is reduced to .063. Further 
increasing the spread of the 7 distributuion will have the effect of further 
reducing Q;; (for correspondingly lengthened stimulus sequences) while 
keeping Q,;; constant. Thus, the greater the spread of the 7 distribution, 
the more readily can such "divergent" time sequences be distinguished. 
Conversely, two sequences which are identical save for a momentary 
divergence in recent time (say at 7 -/ ) can be distinguished most readily 

by a perceptron with Tey concentrated at small values, and increasing 

the spread of the 2 distribution will only increase the difficulty of 


discrimination. 
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It should be emphasized that the set of active A-units depends 
on the order and not merely on the constituents of a stimulus sequence. Thus 
the sequence (S,» S, > 5; ) will generally activate a different set of A-units 
from the sequence ( 5, , 5S, , $3 ) in which the first two members have been 
inverted. In principle, a perceptron of this type which receives sequences of 
sound spectra from a set of audio-filters (instead of visual patterns) should be 
capable of distinguishing spoken words, or other characteristic sound sequences, 


such as progressions of chords or melodic fragments. 


11.3 Models with Normal Distribution of 7; ; 


A somewhat more "natural" model than the discrete spectrum 
models considered above is one where the transmission time of each connection 
is an independent random variable drawn from a normal distribution, with 
parameters (7) and o(7T) . If an A-unit is to have a non-zero proba- 


bility of being active at time ¢ in such a model, the dynamics must be 
modified by the introduction of an "integration period", 4¢ , such that 


t 
oc (t) = D_ ET) - (7) 
T-t-4t (11.3) 


summing over all values of 7 for which —& or I ¢he numbers of excitatory 


or inhibitory impulses arriving at the A-unit) are non-zero. 


The qualitative properties of such a system are clear without 
further analysis. If A4¢ is short compared to o (7) , the presentation of 


a "momentary or transient stimulus will lead to a gradual increase in the 
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proportion of responding A-units (or the value of Q, ) followed by a gradual 
decrease. If A* is greater than o(T) , the system will respond with a 
momentary burst of activity, maintained for a period equal to 47 , and 
will then immediately relapse to inactivity. We are chiefly concerned with 
the case where At is less than «(7t). In this case, the performance of 
the system in discriminating sequences will be close to that of the Poisson 
or binomial models, with an appropriate discrete spectrum of ae » to 
approximate the normal distribution. There will be a maximum sensitivity 
to differences between the two sequences 3 and of occurring at 


time ¢ - «(T) , with less sensitivity to more recent or more remote 


differences between the sequences. 
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12. PERCEPTRONS WITH MULTIPLE R-UNITS 


Up to now, the simple "three-layer'' topology (S-A-R) witha 
single R-unit has been the only one considered. In this chapter, we will 
still consider only three-layer perceptrons, but more than one R-unit will 
be permitted. The performance of such systems, it will be seen, does not 
differ significantly from that of perceptrons which have been considered in 
previous chapters, except for the fact that it is now possible to form classi- 
fications with more than two classes, with simple R-units, or to have 
perceptrons respond simultaneously to several different attributes of a 
stimulus pattern. The most interesting analytic problems for such systems 
are concerned with the optimum coding of the classes of patterns to be 


recognized, in order to optimize performance. 
12.1 Performance Analysis for Multiple R-unit Perceptrons 


Several types of topological organization which are possible for 
networks with more than one R-unit are illustrated in Figure 35. The set of 
A-units which are connected to a given R-unit will be called the source-set 
of that R-unit. The organization which is most economical in the number of 
A-units employed is that shown in Fig. 35(a), where every A-unit is connected 
to every R-unit. This is logically equivalent to the disjoint source-set model 
shown in Fig. 35(b), if every source set is required to have the same compo- 
sition of origin point configurations for its A-units. Unless otherwise specified, 
it will be assumed that each R-unit receives the same number of input 
connections; however, if the R-set is large, and the terminus of each connection 


from an A-unit is selected at random, the total number of inputs to each R-unit 


-273- 


Google 


(a) EVERY A-UNIT CONNECTED TO 4 R-UNITS. (1N FULLY COUPLED CASE, 4 = N,) 


(b) DISJOINT SOURCE-SET FOR EACH R-UNIT. (SPECIAL CASE OF (a) WHERE & = 1) 


(c) EACH R-UNIT HAS SOURCE SET OF N RANDOMLY SELECTED A-UNITS. (EQUIVALENT TO (a) IF N= N,) 


Figure 35 TYPES OF TOPOLOGICAL ORGANIZATION FOR PERCEPTRONS WITH MULTIPLE R-UNITS 
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(i.e., the size of its source set) will be a binomially distributed random 
variable. An inversion of this connection procedure is shown in Fig. 35(c). 
In this case, each R-unit receives exactly N connections, but the origins 
are assigned at random among the A-units. Here the number of output 


connections from an A-unit will be a Poisson distributed random variable. 


It can be readily seen thatas WV, becomes large, the various 
topological connection schemes illustrated in Fig. 35 all become logically 
@quivalent in their performance characteristics, since it does not matter to 
the performance of the perceptron whether two R-units are connected to the 
identical A-unit or to two different A-units with equivalent origin point 
configurations. For the sake of specificity, the following discussion will 
assume the organization illustrated in Fig. 35(b), with a disjoint source-set 


for each R-unit. 


In S-controlled discrimination experiments, it is obvious that 
performance of such a system in equivalent to that of NV, simple perceptrons 
(where Np is the number of R-units) each of which is exposed to the same 
training sequence, but trained on its own independent dichotomy of the environ- 
ment. For example, if Np, = 2, one R-unit might be trained to discriminate 
between stimuli in the upper and lower halves of the field, while the second 
R-unit is taught to discriminate between right and left halves. The proba- 
bility that both responses are correct, at the end of the training sequence, 
will be the product of the probability that &, is correct on its dichotomy, 
and the probability that R, is correct on its dichotomy. In the present case, 
assuming that stimuli occur with equal frequency in all parts of the field, we 
would expect the two dichotomies to be equally difficult, so that the probabi- 
lity of correct performance on the joint response would be the square of the 


probability of correct response for either dichotomy considered separately. 
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In an error correction procedure, a more interesting problem 
arises. Clearly, if each R-unit and its set of input connections are corrected 
on an assigned binary classification or response function independently of the 
other R-units, the same situation exists as in S-controlled experiments, and 
the probability of correct response on the entire set of VV, R-units after a 
given training sequence will be the product of the probabilities for each of the 
response functions considered separately. More generally, if we let 

Py (R;,(W), N; ) = probability of correct response on test stimulus 5, 
for the ri response function, given a source-set with V; members 


connected to the R-unit, we have 


Py (Ryrevey BR) = TT Py (PR; (W), Nz) (12.1) 


for the probability that the joint response to 5, is correct on all R-units. 


Suppose, however, the reinforcement control system is only 
capable of recognizing that the total response (on all R-units jointly) is right 
or wrong, and cannot tell which individual R-units are contributing to the 
error. In this case, it might be supposed that the system would eventually 
learn the correct joint response by assuming that every R-unit is wrong 
whenever an error in the composite response occurs, and correcting the 
perceptron accordingly. This supposition, unfortunately, is not true, 


as proven by the following theorem. 


THEOREM: Given a perceptron with more than one R-unit, and a 
response function P(W) or a classification C(W) for which 


a solution exists, it may be impossible to achieve 
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this solution by an error correction procedure which 
applies negative reinforcement jointly to all R-units 


based on errors in the joint response. 


PROOF: The theorem can be proven by a simple example. Consider the 
perceptron illustrated below, which has two sensory units, two A-units, 


and two R-units. (The topology corresponds, in this case, to Fig. 3 Xa). 


ay CH 
oF R, 
Cia 
Cay 
5 R 
2 AQ» C22 2 


Assume all 27, initially = +1. Let W consist of two stimuli: 5, 
illuminates sensory point 3, alone, and So illuminates 4, alone. Let 


the required joint classification function be: 


# * 
(r,, v2) = (+1, -1) for S, 


(r;, r;) =(-/,+1) for S, 


A solution clearly exists, e.g., by making Vy and Vea positive, and Vie 
and 2, , negative. Since all v, are initially positive, whichever 
stimulus occurs first (say S; ) will elicit a positive output from both R-units, 
which is wrong. The error correction procedure would then apply negative 
reinforcement to both R-units, having the effect (if 5S, is the stimulus) of 


making both connections from @, negative. But this now makes both 
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R-units negative, which is still wrong. Clearly, the error cannot be 
corrected by reinforcement in the presence of 5S, , since the signals to 
both R-units are coupled, and must rise or fall together. If the second 
stimulus should occur, the situation is not improved, and the same oscil- 
latory behavior will continue, with the perceptron switching from 

( r, r;) = (+/,+41) to (-/,-/) alternately. Thus a solution will 


never be achieved, which proves the theorem. 


Note that if, instead of administering negative reinforcement 
to all R-units (which assumes that each one is currently wrong) the error 
correction procedure were to be modified to apply a correction to each 


response unit according to the rule 


* « 

Ge CR, - Coy (12.2) 
where 7- = value of 7 employed in reinforcement of the &- connections, 
and R and r: are the required and obtained responses, respectively, 
for eB R-unit, we then have the same conditions as in the case of 


independent correction of each R-unit (see Definition 41, Chapter 5). Thus, 
if we let "a = R’-f* bea vector of N,. components, the a component 
being given by (12.2), the system will always converge if a solution exists. 
This implies, however, that the r.c.s. must not only be able to recognize 
the existence of an error in some R-component, but must be able to deter- 
mine the magnitude (or at least the sign) of the error for each R-unit 
independently, and control an appropriate value of 7; for each section 
of the network. A logically similar procedure, which also yields a 


solution, is to allow the r.c.s. to scan the R-units sequentially, checking 
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the correctness of each one in turn, and applying a correction only to the 
R-unit currently being examined by applying negative reinforcement when 
it is wrong. This requires a longer training process, but requires the r.c.s. 


to act on only one component at a time, just as in a simple perceptron. 


12.2 Coding and Code -Optimization in Multiple Response Perceptrons 


A perceptron with a large number of R-units can clearly be 
used to identify many more than two alternative kinds of stimuli. A number 
of possible schemes for the representation of information in such systems 
have been suggested. As a first possibility, each response may be used to 
identify an independent trait, or property of the stimulus, such as left/right 
location, size, horizontal or vertical elongation, etc. The combination of 
responses occurring when a test stimulus is presented should then serve as 
a description of the stimulus in terms of its traits. An alternative scheme 
is to assign a distinct response unit to each kind of stimulus, and train the 
nerceptron to emit a +l response only if that type of stimulus is present. 
In this case, only one R-unit at a time would be active, the active unit 
identifying the stimulus class. Unlike the first scheme, where some response 
must be made for every binary trait whether applicable or not, the second 
scheme has the possibility of rejecting a stimulus altogether as "unknown", 
in which case all R-unit outputs would be negative. On the other hand, the 
second scheme lacks the economy of which the first is capable, and requires 
that every combination of traits which is to be distinguished must be assigned 
a special category and taught to the perceptron before it can be recognized. 
In the "trait discrimination" approach, a new configuration may still be 


correctly described, in terms of the characteristics present, even though it 


-279- 


Google 


has not been seen before. (This last feature is only weakly present in 
the perceptrons considered thus far, since it depends strongly on generali- 
zation. Some of the perceptrons to be considered in later chapters, which 


generalize more effectively, can make optimum use of "descriptive codes". ) 


The above examples illustrate two types of response-codes, which 
will be called configuration codes and position codes, respectively. A 
configuration code employs the R-units independently of one another, assigning 
an arbitrary dichotomy to each. This results in the assignment of a binary 
number (if the R-units are two-state devices)to each stimulus, The total num- 
ber of stimulus types which can be encoded in this fashion, for a perceptron 
with No R-units, is 2 eR . A position code, on the other hand, permits 
only one R-unit to be ''on'"' (or in the positive state) for any one stimulus; the 
code takes the form of a binary number of “, bits all but one of which are 
zeros. The position of the non-zero bit indicates the class of the stimulus 
identified. With this system, only V, types of stimuli can be recognized. 
The position code can be considered a special case of a configuration code in 
which the positive classes of all dichotomies are disjoint, and the negative 
classes are almost completely intersecting. A compromise between the two 
approaches (which permits a descriptive statement to be obtained about a 
stimulus without forcing a decision on inapplicable characteristics) would 
assign » response units to each set of 7 mutually exclusive traits (for 
example, 2 R-units waild be assigned to left/right description, 3 to hori- 
zontal, vertical, or diagonal specification, etc.). Each R-unit would then 
be made to discriminate between "trait present" and "trait absent", 
permitting any combination to occur. Such a system will be classed under 


configuration codes. 
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The problem of finding an optimum code for a particular task can 
be specified for a given value of Ne » an environment, W , anda classifi- 
cation, C(W) , intoM types of stimuli. Clearly, if MV is greater than Ne ' 
a configuration code must be used, or the problem is insoluble. If N_ is 
commensurate with Np » however, we have a choice of either assigning 
a position code, in which each R-unit identifies the presence or absence of 
a single type of stimulus, or assigning a configuration code, in which each 
R-unit is assigned an arbitrary dichotomy. In general, the problem is to 
find the optimum set of dichotomies to be assigned to the R-units, so as to 
obtain the greatest probability of correct identification for an arbitrarily 
selected test stimulus. Let us assume all stimuli equally likely to occur, 
and all classes of equal size (i.e., an equal number of stimuli in each). The 


number of A-units connected to each R-unit is also assumed to be constant. 


2 eo & 
Let the vector R°=(7,7f,, ty.) = the correct response 
vector for a given test stimulus. Then, from equation (12.1) we are 


required to maximize 
P, (R") = IT Px (r.") 


Since we further assume that Sy is chosen arbitrarily, and that every 
stimulus is equally likely to be chosen as a stimulus, we require the 


expected value 


E(x) = +) Pe, = 2D Tat 
* 7 ds Pe 0 2 TPs ) (12.3) 


to be maximal. The choice of dichotomies which maximizes (12.3) would be 


considered an optimum code for the environment and perceptron in question. 
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At present, no general solution to this problem has been found. Several 
heuristic cues as to the organization of optimal codes are worth noting, 


however. 


(1) If a given stimulus class has members which are 
disjoint from the stimuli of all other classes, while the remaining classes 
have large retinal intersections, it will clearly be advantageous to employ 
a single R-unit for the recognition of the stimulus class in question, with a 
highly assymmetric dichotomy which does not attempt to divide 
the remaining stimuli into two sub-sets, but takes advantage of the 


"natural" dichotomy formed on the basis of location. 


(2) If the relationships of all stimulus classes are symmetric, 
so that no two classes tend to ''stick together" more than any other two 
classes, and no pair of classes are easier to discriminate than any others, 
and if S-controlled reinforcement is to be used, it will probably be best to 
use equal dichotomies for all R-units, ( "> stimuli in each positive set) so 
as to avoid asymmetric generalizations from the larger set to the smaller 
one. The results of the frequency bias experiments, illustrated in Figs. 16 
and 25, appear to support this conjecture. Where an error correction 
method is used, however, empirical results suggest that asvmmetric 


dichotomies are preferable. 
(3) There exist classifications which cannot be achieved by 
means of a position code, which can be achieved with a configuration code. 


For example, consider the following case: Let there be three stimuli in 


W , such that 5; activates a, , 5, activates a, and 53 activates 
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both a, and Qa, - Let there be three simple R-units, each connected to 
both @, and @, . It is required to assign a unique code number to each 
of the three stimuli. With a position code, the R-unit assigned to identify 
S; must give a positive response when both @, and @, are active, but a 
negative response when either @, or @, alone is active. This is clearly 
impossible, with simple R-units. However, if a configuration code is 


employed, we can assign the R-function (r,", r,'> r;) - 


(+1, -1, -1) for S, 
(-1, +1, -1) for 5, 
(+1, +1, -1) for 5, 
is 


which is readily soluble, by an error correction procedure. 2&; 


obviously redundant here, and is arbitrarily set to -1 for all stimuli. 


(4) A general rule, proposed by Joseph, is the following: 
The smallest possible number of R-units should be required to distinguish 
between very similar stimuli. The more dissimilar two stimuli are, the 


more R-units may be allowed to place the two in opposite classes. 


Note that in this example, it is possible to assign an arbitrary classi- 
fication to an environment of 3 stimuli with only 2 A-units. This could not 
be done with a simple perceptron (as proven in Corollary 2 of Theorem 3, 
Chapter 5). The addition of a second R-unit in this model substitutes for 
the missing A-unit which would otherwise be required. 
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In empirical tests with the Mark I perceptron (such as the 
experiments described in the following section) it has been found that the 
choice of a code, even with binary numbers of a fixed length, can easily 
determine whether or not a particular task is within the perceptron's 


capability. 


12.3 Experiments with Multiple Response Systems 


The Mark I perceptron at C.A.L. is equipped with eight binary 
R-units, and 512 A-units, which can be employed in any combination. The 
network topology is of the type shown in Fig. 35(b). A number of experiments 
have been performed (Ref. 30) dealing with the recognition of letters of the 
alphabet and sets of geometrical patterns where multiple classifications are 


required. Two such experiments are illustrated in Figures 36 and 37. 


In Fig. 36, learning curves are shown for an S-controlled 
experiment on the left, and for an error-correction experiment on the right. 
In each case, the perceptron was taught to identify eight letters of the alpha- 
bet, presented in the form of large block letters in random locations, over a 
considerable part of the retinal field. In the error correction procedure, 


each of the erroneous R-units is corrected simultaneously. 


Figure 37 shows the learning curve for the entire alphabet, 
presented in fixed position. A partially optimized binary code employing 
five R-units was used here. This represents about the limit of the capacity 
of the Mark I system. Attempts at teaching the Mark I to recognize all 


26 letters in two type faces simultaneously have been unsuccessful, the 


-284- 


Google 


S-CONTROLLED LEARNING CORRECTIVE TRAINING 
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Figure 36 LEARNING CURVES FOR EIGHT LETTER IDENTIFICATION TASK 
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Figure 37 LEARNING CURVE FOR 26 LETTERS: CORRECTIVE TRAINING 
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maximum performance being about 85% on the combined alphabets, With a 
discrimination task of this difficulty, any displacement of the patterns from 
the position where they have been learned is likely to abolish the correct 


response. 


On easier problems, such as a four-letter discrimination task, 
the choice of code is found to make little difference in system performance. 
The code becomes critical only when the discrimination capability is marginal, 
as in the 26 letter identification task. Given the choice between a position 
code and a configuration code with the number of A-units in a source-set held 
constant, the position code generally seems preferable with the kinds of 
stimulus material employed in these experiments. If the same total number 
of A-units must be divided among the source sets of the additional R-units 
used for the position code, however, better performance is obtained with 
the more economical configuration code, which uses binary numbers for 


identification, with larger source sets. 
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13. THREE-LAYER SYSTEMS WITH VARIABLE S-A CONNECTIONS 


In the foregoing chapters, we have almost exhausted the 
possible ramifications of minimal three-layer perceptrons, having an 
S~A--R topology. Only one constraint remains to be dropped, in order 
to obtain the most general system of this class: this is the requirement that 
S to A-unit connections must have fixed values, only the A to R connections 
being time-dependent. In this chapter, variable S-A connections will be 
introduced, and the application of an error-correction procedure to these 
connections will be analyzed. It would seem that considerable improvement 
in performance might be obtained if the values of the S to A connections 
could somehow be optimized by a learning process, rather than accepting 
the arbitrary or pre-designed network with which the perceptron starts out. 
It will be seen that this is indeed the case, provided certain pitfalls in the 


design of a reinforcement procedure are avoided. 


13.1 Assigned Error, and the Local Information Rule 


In order to apply an error correction procedure to all connections 
of a perceptron, including the S - A connections, we must first re-examine 
the concept of "error" which has been employed so far as a criterion for 
reinforcement. In the theorem of Section 12.1, it was shown that it will 
not do to assume that all units of the perceptron are equally in error when 
a mistake in the total response occurs. It was seen that if all connections 
are corrected, on the assumption that both R-units are wrong (in the two 


R-unit case employed for demonstration) a solution may never be achieved. 
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The alternative was to assign an error independently to each R-unit, by a 
suitable criterion, and correct the connections leading to each R-unit in 
accordance with the corresponding error indication. In the present case, 
where A-units as well as R-units are to have their input-connections modified, 
it becomes necessary to assign an error indication to each A-unit, as well 


as to each Re-unit. 


In preceding chapters, the assigned error for an R-unit, &, , 
was taken to be equal to (e* --*) , where @* is the desired response, and 
r*is the obtained response. A positive error meant that the R-unit was to 
be turned to its positive state, and a negative error meant that it was to be 
turned to its negative state, in the case of simple R-units. Similarly, for an 
A-unit @; , we might use a positive assigned error, E; » to indicate that the 
unit is to be turned "'on", and a negative £; to indicate that it is to be turned 
"off'', or made inactive, in response to the current stimulus. The difficulty is 
that whereas £*, the desired response, is postulated at the outset, the desired 
state of the A-unit is unkmown. We can only say that we desire the A-unit to 


assume some state in which its activity will aid, rather than hinder, the 


perceptron in learning the assigned classification or response function. 


One possible way of obtaining the required activity states of the 
A-units would be to examine each possible state of the system, with its 
corresponding G-matrix, and determine whether or not a solution to the 
assigned problem exists. If a state is found in which a solution does exist, 
then the appropriate responses can be taught to each A-unit, by means of a 


standard error-correction procedure, operating on the A-units in the same 
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manner as on the R-units. Such an approach, however, evades the real 
issue of finding a procedure which will guarantee convergence to a solution 
without requiring that the reinforcement control system know the solution 
state ahead of time. Specifically, in assigning an error-indication to an 
‘A-unit, we wish to base the assignment only on the state of the network at 
the time and locality where the error occurs. The following rule will 


therefore be accepted as a working premise for all models to be considered: 


LOCAL INFORMATION RULE: For any A-unit, a; , the assignment of an 
| error £-(¢) can depend only on information concerning the 
activity or signals received by @,; , the value of its output 
connections, and the error assignment at their terminal points 


attime ¢ . 


In other words, only @;_ itself and the points to which it is directly 


connected can determine the error assignment. 


13.2 Necessity of Non-deterministic Correction Procedures 


By a "deterministic reinforcement procedure" we mean that if 
the same state of the system should occur repeatedly with all signals and 
values unchanged, an identical reinforcement will be applied; and that if 
two similar subnetworks are in the same state of activity, value, and error 
assignment, they will be modified identically. Up to this point, no problem 
has been found for which a solution exists, where a suitably defined 
deterministic reinforcement procedure could not find a solution. The first 


exception to this is stated in the following theorem. 
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THEOREM 11: Given a three-layer series-coupled perceptron with 
simple A and R-units, and variable-valued S-A connections, 
and a classification C(W) for which a solution exists, 
it may be impossible to achieve a solution by any determi- 
nistic correction procedure which obeys the local inform- 


ation rule. 


PROOF: The proaf is by example. Consider the following network: 
Q; 
Sy R, 
2 R 
2 a, 2 


Let a, and a> have thresholds of 1, and let the stimuli of W consists of 
4, alone (stimulus S$, )or 4, alone (stimulus S; ). Let the required 
classification be (R;/, es) =(+/,-/) for S, and (-1,+1/) for S, 
A solution clearly exists; for example, the following assignment of values 


would be satisfactory: 
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In this problem , a solution clearly requires an asymmetric assignment of 
values for "parallel'' and "crossed" connections from each sensory unit and 
from each A-unit. If we assume that all values are initially equal, then 

either @, and @, are both on, or else both are off. In either case, one 

of the R-units is wrong, and whichever one is wrong will induce a symmetric 
correction of the values from both A-units. Moreover, since both a, and @), 
are in indistinguishable states (whichever R-unit happens to be wrong) under 
the local information rule both units must receive an identical error indication. 
But then the connections from whichever S-unit is active will both be modified 
identically, and the result is that the members of each value-pair (from each 
S-unit and from each A-unit) are still identical. The required asymmetry 
between "parallel'' and "'crossed'' connections can therefore never arise, and 


the same response must always occur for 5, and 5S, -Q.E.D. 


While this theorem shows that a deterministic procedure cannot 
be guaranteed to work, it remains to be shown that a non-deterministic 
procedure will work. In the most extreme case, we could employ a procedure 
which randomly varies the value of every connection, independently of the others, 
as long as errors continue to occur. In this case, if the phase space of the 
system is bounded, a solution will certainly occur in finite time, but we have 
already seen the devastating consequences of a much less drastic randomization 
of the reinforcement process on learning time (c.f., Figure 19). In the 
following section, a more systematically directed procedure is presented, 
which can be shown to lead to a solution with probability 1, as in the case 


of error correction procedures considered for elementary perceptrons. 
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13.3 Back-Propagating Error Correction Procedures 


The procedure to be described here is called the ''back- 
propagating error correction procedure" since it takes its cue from the 
error of the R-units, propagating corrections back towards the sensory 
end of the network if it fails to make a satisfactory correction quickly at 
the response end. The actual correction procedure for the connections to 
a given unit, regardless of whether it is an A-unit or an R-unit, is perfectly 
identical to the correction procedure employed for an elementary perceptron, 
based on the error-indication assigned to the terminal unit. Thus, if the 
error £- is positive, a correction is applied to the values of the active 
connections terminating on @; which would tend to increase the signal to a; 
algebraically, eventually turning it on"; if &; is negative, a correction, 

¥Y ,. of the opposite sign is applied to all active connections terminating on 
Qa; . The essential feature of the method is a probabilistic procedure for 


é 


assigning the errors, F; 


The rules for the back-propagating correction procedure are 


as follows: 


1. For each R-unit, set EF, = Pp”. r*, where "= 


required response and /”= obtained response. 


Zs For each association unit, @Q- 


; » &; is computed as 


follows, for each stimulus: Begin with E&, = 0. 


a) If @; is active, and the connection <-, terminates 
on an R-unit with a non-zero error Ep which 
differs in signfrom 7v,;, , add -lto £; with 


probability P, 
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b) If @; is inactive, and the connection <;, 
terminates on an R-unit with an error £, which 
agrees in sign with v;, , addtlto £; with 


probability p, 


c) If a; is inactive, and the connection ¢,;, terminates 
on an R-unit with an error £, which does not agree 
in sign with 7, (or if 7, is zero) add +l to &; 
with probability P3 


For all other conditions, £: 


; is not changed. 


3 If fy *#0O -., add 7 to all active connections terminating 
on the A or R-unit u; , taking the sign of 7 to agree 
with the sign of £; . In symbols, 


Av;; = a: sgn (E-)eé 


where € is the magnitude of 7 . 


In general, P, and f, are taken large relative to P; ° The effect of these 
rules is to try to turn off any A-units (with probability », ) whose output is 
currently contributing to an error in an R-unit, and to try to turn on any 
A-units (with probability P2 ) which are currently off, but whose out- 

put signals would help correct an error in one or more R-units if they 

were on. The purpose of the third probability, y, , is twofold; first, 

if no A-units respond to a stimulus, and all of the values have the wrong 
sign or are zero (as in typical initial conditions) it guarantees that some 
A-units will come on; second, it prevents the permanent loss of A-units 


which might be necessary for the proper response to some stimulus, 
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even though their values may have the wrong sign at some time during the 
training procedure. If p, and #, are larger than Py » the main changes 
in the network will clearly all ténd to go in the direction of a solution. The 
following theorem proves that the procedure is sufficient to guarantee a 


solution, if a solution exists, in the form of some assignment of values to the 


network. 

THEOREM 2: Given a three-layer series-coupled perceptron, with 
simple A and R-units, variable-valued S-A connections, 
bounded A-R values, and a classification C(W) for which 
a solution exists, then a solution to C(W) can be obtained 
in finite time with probability 1 by means of a back- 
propagating error-correction procedure, given that each 
stimulus in W always reoccurs in finite time, and that 
probabilities P;> Pa? and Py are all greater than 0 
and less than 1. 

PROOF: The state of the S-A network can be characterized, for present 


purposes, by an NW, by - matrix, A” , which consists of the Na row vectors: 


# & . * 
Az = (Ajy 1 Qa ree Bp 


where a; = 1,0 # signal generated by unit @; in response to 
stimulus S; . Two assignments of values to S-A connections which yield 
the same A” -matrix will be called equivalent S-A states. To each such 
matrix, A’, there corresponds a G-matrix for the perceptron. We will say 
that a given S-A state permits a solution if the corresponding G-matrix is 


one for which a solution to C(W) exists. 
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First, suppose the system is initially in a state which permits 
a solution. Then if it remains in this state sufficiently long, a solution must 
occur with probability 1, due to Theorem 4, of Chapter 5. Since S-A 
connections only change in value if the errors £: are assigned magnitudes 
other than zero, and since the probabilities P, » Po » and P; of assign- 
ing non-zero £; are all less than 1, there is a probability p>0O that the 
perceptron will remain in its initial state for any given finite time. Thus, 


there is a probability greater than zero that a solution will be achieved 


x 
before any change in the A -matrix occurs. 


Next, suppose the Ap matrix changes to some different state 
before a solution is achieved, or suppose that the system starts out in a 
state which does not permit a solution. Then it is sufficient to show that 
the system will always return to a state which does permit a solution in 
finite time with probability 1, and that the probability P of obtaining a 
solution for a given S-A state does not approach zero with successive 
returns to the same state. If it does always return to such a state, then 
each time it arrives at such a state, there will be a probability greater 
than zero (and bounded away from zero) that it finds a solution before the 
state is destroyed. Thus, with sufficiently many returns to states which 


permit solutions, a solution will be found with probability 1. 


It is now necessary to show that from an arbitrary starting 
* 
state, the system will always achieve an A -matrix which permits a 


solution in finite time with probability 1. 
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* 
If the current A -matrix does not permit a solution, then 
either or both of the following conditions must be present: 
Le 


‘J 
possible is actually 0; 


(a) Some c which should be 1 for a solution to be 


(b) For some Li; which should be 0 and is actually 1, 
there must bea 7; the sign of which disagrees with 


R* for stimulus =F : 


The second condition follows from the fact that if every active connection 
from A to R-units has a 7;, with proper sign for every S; , and if 
condition (a) is not present, then a solution already exists. Now suppose, 
for an arbitrary A ate be Stimulus S; occurs. Then condition (a) may 
exist for some A-units, and condition (b) for others. For each A-unit 
which is currently off (including all of those to which condition (a) applies) 
Rule 2b or 2c of the correction procedure becomes operative, and there is 
some probability that each such unit will receive an error indication. Since 
we have assumed the activity of these units to be necessary for a solution, 
and have postulated that a solution exists, there must be some assignment 
of S-A values for each such unit which will turn it ''on" for S; . Since 5; 
is postulated to reoccur infinitely many times, then it follows from 
Theorem 4 of Chapter 5 (treating the A-unit and its input connections as 
equivalent to an R-unit) that the required rare will ultimately be obtained. 
Since each A-unit is corrected independently of the others, a state will 
ultimately occur in which all of the A-units which were wrong by condition (a) 
have been corrected. Next consider those A-units for which condition (b) 


applies. For these units Rule 2a of the error correction procedure is 
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applicable, and by the same argument as above, the c? . 


all be corrected. But in that case, we have arrived in a state which permits 


will ultimately 


a solution. Since there is nothing in the above argument which depends on 
states prior to the arbitrary starting state, the system can arrive at states 
permitting solutions indefinitely often, and a solution must therefore occur 
with probability 1, provided the probability “ of finding a solution while in 
such a state does not approach zero. This last assumption, though plausible, 


still remains to be rigorously proven for the general case. 


For the special case in which the values 7, are bounded, the 
remaining assumption can be proven without difficulty. In the proof of 
Theorem 4, in Chapter 5, it was shown that the number of corrections 


necessary to find a solution is at most equal to 


M(h+ ern)” 
a(éE-d) 


where M and c¢ are constants depending only on the G-matrix (and 

the refore on A’), and 4 is the length of the vector Hzx°.. Thus the 
number of corrections required to find a solution can incrase only as a 
result of an increase in the magnitude of some components of the starting 
vector, x° , upon successive returns to the same S-A state. But if all 
values 7, are bounded, the components of zx° are also bounded. Conse- 
quently, 4 has an upper bound for any given. H (or for any given A’). 
This means that there is a maximum number of corrections that might 
possibly be required (assuming that a solution exists) and that the proba- 
bility p of arriving at a solution before destruction of the A’ state is not 


only greater than zero but must be bounded away from zero. Q.E.D. 
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13.4 Simulation Experiments 


At the present time, no quantitative theory of the performance 
of systems with variable S-A connections is available. A number of simu- 
lation experiments have been carried out by Kesler, however, which 
illustrate the performance of such systems in several typical cases, shown 
in the accompanying figures.” In order to show the performance of the 
variable S-A system to its best advantage, small perceptrons were used, for 
which the learning of a horizontal/vertical bar discrimination (Experiment 6) 


falls short of what might be obtained with an optimum S-A organization. 


Figure 38 illustrates the effect of various combinations of the 
probabilities fi, 9 Pps and P, (including the 0,0,0 case where all S-A 
connections remain fixed, for comparison). The curves show the mean 
performance for 20 perceptrons, with 50 A-units, having 10 input connections 
to each. The initial values of all S-A connections are set equal to +10, and 
the threshold is 50. The same set of 20 networks and training sequences 


was used for each probability combination. 


It is found that if the probabilities of changing the S-A 
connections are large, and the threshold is sufficiently small, the system 
becomes unstable, and the rate of learning is hindered rather than helped 
by the variable S-A network. Under such conditions, the S-A connections 
are apt to change into some new configuration while the system is still 
trying to adjust its values to a solution which might be perfectly possible 
with the old configuration. Better performance is obtained if the rate of 
change in the S-A network is sufficiently small to permit an attempt at 
solving the problem before drastic changes occur. To improve the stability 


* The experiments were carried out with the Burroughs 220 computer at 
Cornell University, and the IBM 704 at the A.E.C. Applied Mathe- 
matics Center at New York University. 
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of the network»in all experiments shown here, the A-R connections are 
reinforced,for each stimulus, before determining whether a correction should 
be propagated back to the S-A network. Thus, S-A connections are changed 


only if the system fails to correct an error at the A-R level. 


In Figure 39, mean performances of a number of 20 A-unit 
perceptrons are shown, in one case with 4 connections, and in a second 
case with 50 connections to each A-unit. These perceptrons are small enough 
so that in many cases we would expect no solution to exist to the horizontal / 
vertical bar problem (which requires the classification of 40 stimuli with 
only 20 A-units) were it not for the modifiable S-A network. Initial values 
of S-A connections are again equal to 10, and thresholds are2™ , where 
‘ = number of connections to each A-unit. Note that with 50 fixed connections 
to each A-unit the performance is poorer than with only 4 connections, but that 
with /,=.9,/,=.3 and &=.!/ , the performance overtakes the 4-connection 
model. This is because with large numbers of S -A connections, the per- 
ceptron can effectively take its pick of whatever organization might be most 
helpful, and can always reduce excess connections to zero value, while 
with only a small number of connections at its disposal it is seriously limited 
in its potentialities. With only 4 connections, variable S-A connections have 


little effect on performance. 


These experiments suggest that the best performance will 


generally be obtained by taking /, > P, >P,. 
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An interesting application of the variable S-A system is in 
pre-conditioning a perceptron for stimuli of a particular type (such as line 
figures, or blob patterns) by giving it a number of discrimination tasks to 
perform on typical material of the given type, and then trying to teach it a 
new discrimination on the same kind of stimuli. Due to the prior adaptation 
of the S-A system, it is to be expected that the learnim curve for the final 
discrimination task should show faster learning after the period of pre- 
conditioning than if the same discrimination task had been attempted with 
the original randomly organized S-A network. In other words, the S-A 
network should become adapted to the stimuli of a particular kind of universe, 
performing better on typical discrimination tasks involving ''familiar'" kinds 
of stimuli than on tasks involving radically different or "unfamiliar" kinds 


of stimuli. 
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14. SUMMARY OF THREE-LAYER SERIES-COUPLED SYSTEMS: 
CAPABILITIES AND DEFICIENCIES 


The three-layer series-coupled perceptron (S--A-R perceptron) 
is the least complicated topological organization which yields fully general 
response-capabilities. The analysis presented in the preceding chapters 
leads, in effect,to the following conclusion: With a suitable design and 


training procedure, a three-layer series -coupled perceptron can be taught 


to duplicate the performance of any finite automaton. This means that if we 
have a finite universe of potential input sequences ( J o J gett J, )and 


orcree ZX, ), then it is 


possible to construct a minimal perceptron such that any response sequence, 


a finite set of possible response sequences (# > R 


RR; , can be associated with each possible input sequence, v, . In order 


to do this with full generality, of course, a suitable spectrum of time delays, 
7;  » must be present, as indicated in Chapter 11. 


Both the generality and the practical limitations of the above 
statement should be emphasised. It is perfectly possible, in principle, to 
teach a minimal perceptron to duplicte the performance of an arbitrary digital 
computer. To do this, every possible sequence of coded instructions and data 
must be represented as a stimulus sequence (one of the J; ) and the set of 
output numbers generated by the computer as a response sequence (one of 
the KX; ). If the perceptron is large enough, it can then be trained, with 
an error correction procedure, to make the appropriate association of input 
and output sequences. But what the perceptron learns by this process is to 


simulate the behavior of the digital computer; it does not acquire the 
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computer's logic. If any one of the trillions of possible programs were 
omitted from the training sequence, the perceptron would probably fail to 
perform correctly if tested on the omitted sequence. The failure to genera- 
lize, or to learn logical rules, in such a problem makes such an application 


of these minimal perceptrons totally impractical. 


For practical purposes, we will limit our remarks to the 
performance of these perceptrons in recognizing and reporting environmental 


events. In this connection, the following capabilities have been established: 


(1) A three-layer series-coupled perceptron can be 
taught to associate an arbitrary coded output, or sequence of outputs, KX; , 


to each stimulus, or stimulus sequence, /; , in a finite environment. 


(2) The perceptron need not be explicitly designed for the 
task which it is required to learn. The same network may be taught a 


variety of alternative outputs, or codifications, of the same environment. 


(3) The required training can be accomplished by means of 
an arbitrary sequence of events from the specified environment, regardless 
of the order or frequency with which they occur, provided each event 


ultimately reoccurs in finite time. 


(4) The training can be accomplished regardless of the 
initial state of the perceptron's memory, and without specifying in detail 
the changes which must take place in the state of the system (i.e., general 


dynamic laws are sufficient to bring about the required adaptation). 
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(5) A perceptron will tend to assign the same response to 


any two stimuli or stimulus sequences, v- and J. » which are close to 


identity under temporal translation. By means of discrimination training, 
however, it can be made to associate a different response to each such 


stimulus. 


With this kind of universality in the performance of the system, 
we obviously cannot hope to find any new kinds of response capabilities in 
more complex or sophisticated networks, which cannot be realized by 
minimal perceptrons after suitable training. Nonetheless, the three-layer 
series-coupled perceptron clearly falls far short of biological systems in 
some respects. The differences lie not in what the system can learn to do, 
but rather in the speed, efficiency, economy, and reliability of learning or 
adaptation. An S*+A-R perceptron can be taught to play a game, such as 
checkers, only by teaching it what response to make inevery conceivable 
situation; a biological system can anticipate most of this training by 
learning the rules of the game. Or, similarly, an S*A~>R perceptron can 
distinguish a circle from a triangle in the lower half of its retina only if it 
has previously been trained with triangles and circles in the lower half of 
its retina; it will not generalize from experience with similar forms in the 
upper half of the field. In Nature, the enormous number of sensory situations 
which comprise the potential universe (each situation, individually, having 
exceedingly low probability of occurrence) makes the capabilities of 
generalization, analysis, and abstraction absolutely essential for an 


advanced organism, or recognition device, to function properly. Two main 


ingredients of such performance are recognition of similarity and recogni- 
tion of functional parts, or entities. The first of these is basic to generali- 
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zation and induction, while the second is basic to analysis, the abstraction 
of relations, and the reduction of complex situations to familiar terms. 
Seen in this light, the principal deficiencies of these minima] -topology 


perceptrons are: 
(1) An excessively large system may be required. 
(2) The learning time may be excessive. 


(3) The system may be excessively dependent on external 


evaluation (by an independent r.c.s.) during learning. 


(4) The generalizing ability (inductive ability) is insuffi- 


cient. 


(5) Ability to separate essential parts in a complex 


sensory field (analytic ability) is insufficient. 


Point (1) is largely attributable to (5); the excessive size of 
the perceptrons necessary to deal with complex environmental situations 
is due largely to the necessity of having a characteristic set of A-units 
representing every possible sensory field or sequence in its entirety. A 
preliminary coding of the field in terms of its parts and relations would 
greatly reduce the size of the system required to describe a given universe 
of situations. To take an extreme case, if a three-layer series-coupled 
perceptron is required to produce as an output the coded representation of 
the sum of a sequence of a million digits, it must be capable of representing 


in its association system every possible sequence of a million digits 
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6 
(presented either serially or simultaneously): 191° possibilities in all. 


On the other hand, a perceptron which could attend selectively to each 

digit, form a partial sum, and then go on to the next digit, requires only 10° 
possible states: 10! to represent the possible values of the partial sum, 
multiplied by a factor of ten to allow for each of the possible incoming digits. 
The second method is the one employed by a digital computer, or a man 
adding a sequence of numbers. In the field of sensory pattern recognition, 
similar conditions occur. The recognition of a sentence is made much 
easier by breaking it into words, and the recognition of a scene is made 


easier by analyzing it into objects and relations. 


Similarly, the excessive learning time (point 2) can be largely 
attributed to (4), the insufficient generalizing ability of the system. With 
improved generalization, several examples should be sufficient to teach 
the perceptron to recognize all members of a class of similar events, 
whereas at present an unduly large sample is required in order to extend 
the response over the class. The insufficient generalizing capability has 
been frequently pointed out in the preceding chapters, and is common to 
all of the S®A~R perceptrons. Thus points (3), (4) and (5) appear to be 


the primary deficiencies. 


In connection with point (3), we note the failure of minimal 
perceptrons to reach "useful'' terminal states under R-controlled 
reinforcement procedures, except under exceptional environmental and 
organizational caditions. This means that the reinforcement control 
system must itself have a great deal of information about the environment, 


and must generally know, or have built into it, the precise discrimination 
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or response functions which the perceptron is supposed to learn. Thus the 
r.c.s. must either be a free agent (e.g., a human trainer) or else some 
kind of homunculus within the same physical system as the perceptron. It 
has been noted that a perceptron can improve over the performance of the 
r.c.s. in some cases (Section 8.1.4) but the functioning of the r.c.s. still 
seems to be rather remote from what might be expected of a biological 
motivating system. By using a random-sign correction procedure, the 
information required from the r.c.s. is minimized; with such a procedure, 
the possible outputs of the r.c.s. can be interpreted to mean "hold steady" 
or "change", while with a directed correction procedure the three alterna - 
tives "hold steady", "increase values", or ''decrease values" are all 
required. But the efficiency of a system employing the randomized 
procedure is greatly reduced (c.f., Figure 19) and the only hope for such 
systems seems to be in a "majority rule'' procedure, which increases the 


size and complexity of the total organization. 


If a system could be contrived which would guarantee 
generalization of a response from one stimulus of a class to all other 
stimuli of that class, an r.c.s. which employs the "trial-and-error" 
process of the random-sign procedure might become practical, and a 
simple motivation system which senses only the suitability or unsuitability 
of the present response or state of the organism might be substituted for 
the more complicated r.c.s. assumed for most of the preceding experi- 
ments. In Part III, it will be shown that multi-layer and cross-coupled 
perceptrons are capable of providing just this sort of generalizing capability, 
and, moreover, that this capability may be ''self-organizing" under 


reasonable environmental conditions. That is to say, R- controlled systems 
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can learn to form reasonable classes on the basis of a similarity criterion, 
provided there is some support for this organization from the environment. 
The required support takes the form of a ''continuity constraint", which says, 
in effect, that stimuli do not occur as momentary flashes, but are more 
likely to persist for a time, during which they undergo a series of move- 
ments or transformations. It will be seenthat such a sequential organization 
provides sufficient information to enable a multi-layer or cross-coupled 
perceptron to abstract a concept of similarity which can then be employed 


to obtain immediate generalization in later situations. 


The improvements which have been demonstrated to date in 
multi-layer and cross-coupled perceptrons will be seen to be primarily 
in the field of generalization phenomena, and their main virtue is in 
reducing the learning time of a perceptron. Some reductions in size 
requirements have also been demonstrated, and the dependence on 
external evaluation of performance is largely eliminated. Thus points (1) 
through (4), in the list of criticisms of minimal perceptrons can be largely 
Or entirely eliminated with a multi-layer or cross-coupled topology. 
Point (5), however, remains the least understood of the current problems. 
While there is some indication that perceptrons of the types to be consi- 
dered in Part III may have some analyzing ability (for example, they can 
isolate contours from solid figures, and may possibly learn to suppress 
the partial response of the association system to irrelevant aspects of the 
stimulus field) it is not yet possible to say whether such systems are really 
sufficient to meet the challenge of point (5), or not. The psychological 


problems of figure-ground organization, recognition of relations, and 
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"cognitive set" are all involved here. It is likely that 'back-coupled 
perceptrons", in which R-units or deep association layers feed back to 
more superficial layers, may be necessary to deal with these problems. 
Several possible approaches will be considered in Part IV, which deals 
with current problems, and attempts to establish directions for future 


study. 
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15. MULTI-LAYER PERCEPTRONS WITH FIXED PRETERMINAL 
NETWORKS 


The perceptrons considered in Part II have all consisted 
of three "layers'' of signal generating elements: a sensory layer, a single 
layer of association units, and a layer of R-units (containing only a single 
unit in the case of simple perceptrons). A perceptron with additional layers 
of A-units between S and R-units will be called a multi-layer system. Thus 


the network diagram: 


represents a four-layer series-coupled system, whereas the diagram 


5 A 


represents a three-layer cross coupled system, since all A-units are at 
least the same logical distance from the sensory units (see Definition 18, 
Chapter 4). The three-layer structure of the second diagram can be made 


clearer if it is drawn in the form: 
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which is topologically identical to the preceding network. Cross- 


coupled systems will be considered in detail in the following chapters. 


It has been demonstrated that three-layer, series coupled 
perceptrons are capable of learning any type of classification, or 
associating any responses to stimuli or to sequences of stimuli, that 
might possibly be required. Therefore, if a multi-layer topology is to 
offer any functional advantages, it will not be in the form of new kinds of 
responses to stimuli (since any such response can be achieved with a 
three-layer system) but rather in increased efficiency in the acquisition 
of such responses. It can, in fact, be demonstrated that the adaptability, 
or ease of acquisition of responses, may be greatly improved with a 
suitable multi-layer topology. The most striking improvements are to 
be found in the generalizing ability of such networks -- an ability to give 
appropriate responses to stimuli for which they have not been taught. It 
has been seen that this ''inductive'' or generalizing capability is present 
only in rudimentary form in three-layer series-coupled systems. Some 
multi-layer systems also show improvements in sensitivity to differences 
between highly similar stimuli, making such discriminations easier to 


learn, as will be seen in Section 15.1. 
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In the following sections, we will first consider systems in 
which all connections other than connections to R-units have fixed values, 
only the R-unit input connections being reinforced. The connections to the 
R-units will be called terminal connections, all other connections (from S 
to A-units, and A-units to other A-units) being called preterminal connections. 
It will be seen in Section 15.2 that the most interesting effects which can be 
obtained by such systems depend on special constraints in the organization of 
the preterminal network. The following chapter will therefore be devoted to 
the examination of dynamic rules by which the preterminal connections 
between layers of A-units can be modified, so as to yield the required organi- 
zations as a result of the system's adaptive functioning, in a suitably organized 


environment. 


The analysis of multi-layer systems is of interest not only in its 
own right, but also because it introduces many of the problems and formal 
techniques of analysis which will be encountered in the following chapters on 
cross-coupled systems, with feed-back loops within the network. In fact, it 
is found that with a suitable transformation, many ''closed-loop'"' cross- 
coupled systems can be represented by an equivalent "open-loop" multi- 


layer system. 
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15.1 Multi-layer Binomial and Poisson Models 


The most straightforward extension of our previous models to 
a multi-layer topology is to assume that each A-unit in the first association 
layer is assigned an origin point configuration in the retina, or sensory 
layer, chosen independently for each A-unit, as before. Each A-unit in the 


(2) 


second layer (designated A 


(1) 


) is similarly assigned an origin point configu- 
ration in the A layer, independently for each such A-unit. In general, 
every A-unit in the afk) layer is independently assigned an origin point 
configuration from an appropriate distribution (binomial or Poisson model), 


the connections originating from the ac 


layer. All connections from one 
A-layer to the next are assumed to be fixed in value, the final A-layer sending 
variable-valued connections to the R-units. In order to analyze the perform- 
ance of such a perceptron, it is sufficient to determine the Q-functions for 

the A-units of the last layer, before the R-unit, since, given these Q-functions, 
we can then apply the same equations and analysis which were employed 

in Part Il, for three-layer perceptrons. The notation Qn e will be 
used to denote the Q-functions for A-units in the first layer (which are 
identical with the Q-functions discussed in Chapter 6), and oF n to 


denote Q-functions for units in the Nap layer. 


Even in the simplest case, of a four layer perceptron, the 
combinatorial analysis required for a rigorous statement of ead functions 
is awe-inspiring. A special case, in which all inter-layer connections are 


(2) 


inhibitory, and the thresholds of all A units are zero, has been 
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analyzed by Joseph (Ref. 41), and the reader is referred to his contri- 
bution for the detailed considerations. The basic difficulty stems from 
the fact that a second layer Q-function, such as oF depends on the 
distribution of the numbers of A-units in the first layer which respond 

to S: alone, 5$ ; alone, and jointly to 5; and S ; + The expected 
values of these numbers are obtainable from the 9!” functions in a 
straightforward manner, but the non-central moments of the distributions 


enter into the analysis in such a way that it becomes unduly complicated. 


A practical solution is obtained by assuming that the numbers 
of A-units in the lst, 2nd,.. a layers (designated by 
nee ni? rae os jare all very large, or infinite. In this case, 
the proportion of active units in each layer in response to S; will be 
equalto Q; , and the expected values of all set-intersections can be 
employed in the analysis. In this case, the equations of Chapter 6 can 
be employed without modification to compute Q: . yn by using ge 
in place of the stimulus area, Kk; ,; va in place of the intersection 
C , etc. ‘The error introduced by assuming infinite VV, for the pre- 


terminal layers will be slight, as long as the actual \, is reasonably large. 


The addition of extra A-unit layers can have one of several 
interesting effects, depending upon the parameters zx , y , and @ 
(or x » y ,and @ ina Poisson model) for each layer. The special 
case of inhibitory connections and zero thresholds was investigated by 
Joseph (Ref. 41), who finds that by optimizing the number of input 
connections to each layer, so as to achieve highest probability of correct 


recognition, @; approaches a constant as the number of layers increases, 
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regardless of the size of the stimuli or the dichotomy which the perceptron 
is required to learn. Atthe sametime, Q; i approaches Q/ ° Q;; y 
approaches Q? , etc. In effect, this represents a condition in which, in 
the terminal association layer, a statistically independent set of A-units 
responds to each stimulus in the environment. The consequence is that 

all discriminations become equally easy. Specifically, it was found that 
the ratio (454 7 ; ) for 100 A-units in the terminal layer approaches 
1.941 as the number of layers is increased, with an environment of 40 
stimuli. A comparison with Table 3, in Chapter 7, shows that this 
performance is less than would be achieved with a three-layer perceptron 
for the task of discriminating horizontal from vertical bars, but it is 
considerably better than the performance of a three-layer perceptron 

on a more difficult task, such as the odd-even bar discrimination illustrated 
in Table 4. Thus the addition of extra association layers can be used to 
improve discrimination in difficult problems, but only at the cost of reduced 
generalizing ability, since two adjacent stimuli with a large intersection are 
(e) 


now no more closely related (in the A layer) than two totally disjoint 


stimuli. 


In Joseph's model, with all inhibitory connections, the above 
results are obtained only by optimizing the number of connections to each 
new layer of A-units. If, instead of carrying out this optimization, a fixed 
number of connections is assumed for all A-units in the system, the 
perceptron will be unstable, and will tend to develop oscillations such that 
alternate A-layers are totally "on" or totally "off'', making all discrimi- 


nation impossible. Moreover, it is to be expected that a model which has 


-318- 


Google 


been optimized for one environment, with a given size of stimuli, will be 
unstable in a different environment, with a slightly different size of stimuli. 
In more practical cases, a mixture of excitatory and inhibitory connections 
must be used, with thresholds greater than zero, in order to guarantee 
stability and convergence of @; for a range of environmental! variations. 
Clearly, if ~<y+9O9™ , Q.@ will not goto las &£ increases. If x«=y , 
a suitable choice of O>0O will generally guarantee, as well, that Q; 

will not go to zero. From Figure 7(b), for example, it is clear that if 

xX=y = 5 ,» and O=/ , an equilibrium should occur at about Q- =.37 , 
since at this point @,(4). = Q; (Aev If Q; (4-1) should rise above .37, 
wewillhave 0-< Qg:4- | while if one” falls below .37 we 

will have oad > Q; (4-4) iF If we increase the amount of inhibition by 
making *=-=3, y=7 _ , then (from the same Figure) we find that the 
equilibrium value of Q; is reduced to .14. If the inhibition is increased 

still further (e.g., to zx=/, y=9 , as in the bottom curve of Fig. 7b) 
the equilibrium value of Q; is zero, and no matter how large a stimulus 

is presented, activity will die away entirely in the "deeper" association 


layers. 


* This observation will generally not be valid for a small perceptron, 
where the actual level of activity may go to zero in one of the layers, 
due to random variations in the network. In this case, Q! will be 


zero for all subsequent layers. Thus, for a finite system, q;“”—+ 0. 


&£ > 00 
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15.2 The Concept of Similarity-Generalization 


So far, the addition of extra association layers has had no 
important effect beyond the sharpening of the discriminative acuity of the 
perceptron, generally counterbalanced by a loss in the generalizing capa- 
bility of the system. In the next section, we will consider a four-layer 
perceptron with special constraints in the organization of the connections 
to the A-units, such that the system tends, spontaneously, to generalize 
a response associated to a given stimulus pattern to all "similar" stimuli, 
regardless of their location in the retinal field. In the following chapter, it 
will be shown that such constraints need not be built into the system ab initio, 
but can arise through a spontaneous adaptation process (without any inter- 
vention by the r.c.s.) if some simple dynamic laws are introduced. In all 
of these systems, the concept of ''similarity'' is of fundamental importance. 


The term "similarity" has been used in a number of different 
ways, some of them well-defined, as in "two triangles are similar", some 
relatively vague and ambiguous, as in ''two faces are similar" or "two ideas 
are similar". For present purposes, we have need of a concept which will 
cover the range of relationships which might make two objects appear 
"similar" to a perceiving observer, but which will still permit exact 
definition for purposes of analysis. We must also distinguish between 
the "objective similarity'' of objects in space, the similarity of stimuli 
on the retina, and the "subjective similarity'' which the observer recognizes 
and reports. While the wncepts proposed here do not cover all of the 
possible meanings of ''similarity" in psychology, they are sufficient to 
permit the design of a number of perceptual experiments related to the 


similarity problem. 
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15.2.1 Similarity Classes 


We will first consider a definition of similarity which is 
applicable to the classification of stimuli. From this point of view, two 
stimuli either are similar or they are not; there are no intermediate degrees 
of similarity. In the following section, a quantitative definition which per- 
mits a multidimensional ordering of objects or stimuli according to their 


similarity will be considered. 


For present purposes, the only constraints which will be placed 
on the logical nature of the similarity relation are that it should be 
symmetric and reflexive; that is, if A sim B, then B sim A, andA is 
always similar to itself. It is not required that the relation of similarity 
should be transitive; that is, A sim B and B sim C does not imply A sim C, 
except under very special conditions, as will be seen below. There are 
clearly a large number of possible relations which meet the logical conditions 
for a similarity relation. For example, equality, geometrical congruence, 
equality of area, and topological equivalence are all admissible possibilities. 
Thus, in specifying the similarity of two stimuli the notation A sim B/# 
will be used, where X is a particular relation, meeting the conditions 


of symmetry and reflexivity. 


The set of stimuli which are similar under a given relation 
will be said to form a similarity class under that relation. For example, 
if H is defmed as the relation of similarity under a rotation group, then 
A sim B|22_ means that A is a rotated image of B, and B is a rotated 


image of A. 
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In perceptual problems, a particular kind of similarity class 
is of particular importance. This will be called a projective similarity 
class, and is defined as follows. Let the sensory points of a perceptron 
be embedded in an r-dimensional sensory manifold, J. Let J be 
embedded in an r+4 dimensional world manifold, 72 . An object in 9” 
is defined as any set of points in Im bs Let (1 bea set of admissible 
objects in 72 . Let .& be any transformation group in 72 . Leta 
projection 77 be defined as an operation which maps every point in 972 
into at most one point in f/f . ThenA sim B|Y, J, fl, 7T means 
that stimuli A and Bare both //7 -projections onto the sensory points in 
wf of transforms under & of the same objectin 2 . 


A few moments reflection should show that this encompasses 
most of the cases in which we say that two stimuli are perceptually 
"equivalent"; for exainple; any group of rigid movements of an object in 
3-space will yield a projective similarity class on a two-dimensional 
retina. Note that this similarity relation is not generally transitive. For 
example, if we let o&” be the group of rigid motions in 3-space, and let 

r= 2 , then the similarity classes generated by a flat cut-out of a 
square in 92 , and by a cube in 92 (with orthogonal projection onto the 


retina) are related by the Venn diagram: 


SQUARE 
PROJECTIONS PROJECTIONS 


* The term "object" is used in much the same sense as "distal stimulus" 
in psychology. Our use of the term .''stimulus" always signifies a 
"proximal stimulus'' unless otherwise specified. 
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where the intersection includes all cases where the square and a face of 
the cube are both parallel to the retinal surface (assuming J to be 
Euclidean, which it is not in a vertebrate eye). A tilted square will be 
projected as a parallelogram, whereas a tilted cube is projected either 
as a rectangle, pentagon, or hexagon, so that the classes, although they 


intersect, are not equivalent. 


For the special case in which the points of an object and all of 
its transforms in 9772 can be placed in one-to-one correspondence with the 
S-points in J/ , the relation of projective similarity will be transitive. 
This includes the case in which #% and J are of the same dimensionality 
and coextensive, objects and transforms consisting only of sensory points in 
92 . Most stimulus classes considezed in experiments up to this point have 
been interpretable in this fashion. Alternatively, 972 might have a higher 
dimensionality than ~ , but the group ~ may be limited to motions 
parallel to the surface of sf . Here again, with a suitable choice of & , 


a transitive similarity relation can be obtained. 


The case of greatest psychological interest is that of a three- 
dimensional world-manifold, 72 , and a two-dimensional sensory manifold, 
J , where 27 is the group of rigid motions and dilatations in 92 . A 
perceptron which generalizes strongly between any two members of a 
similarity class defined by such a relation, and generalizes weakly between 
stimuli which are not in the same similarity class, will duplicate a large 
fraction of the perceptual behavior of a biological organism, in the visual 


* 
domain. 


* A consideration of some of the projection operations which apply to 
this problem can be found in Gibson, Olum, and Rosenblatt, Ref. 27. 
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1552.2 Measurement of Similarity, Objective and Subjective 


Let & bea Lie l-group (of dimension 7 ) of transformations 
of the manifold 9% . Let B be a canonical system of coordinates defined in 
the Euclidean rf -space, & “ , such that every system of equations 
g(t) = a;t (where g-, is the i» coordinate of g in B) gives a one- 
parameter subgroup g(t) . Then the distance 4(0,9) forany g€(#) 


(9 = (9,> a oe 2 Gp) is given by 


40,9) = JEgz 


We then define the similarity measure ~(X, Y)|47%,8 for the objects X 
and Y with respect to « and B as 


we Y |, B = eA COrg) | (15.1) 


where /’ = {g:X = gY } > GED (That is, /' is the set of all trans - 


formations in £% which will transform the object Y into the object xX .) 


Note that this measure is applicable only to objects in 9 
which are similar under «</ ; it is not applicable to stimuli unless éf 
is coextensive with 977 . Consequently, the measure “ will be called 
the objective similarity measure with respect to «& and B. This 


measure represents the length of a sort of ''shortest path" by which /Y 


* Readers who are unfamiliar with the theory of Lie-groups will find a 
useful discussion of this subject in Pontrjagin (Ref. 111). 


-324- 


Google 


can be continuously transformed into X¥ , by means of transformations 
of the group x! . The choice of the basis, 8 , determines the relative 
weighting attached to various subgroups of «% . For example, if 2 is 
the group of translations in 92 , then ,z« can be made proportional to the 


length of the displacement vector which would carry Y into X . 


Let us also define the subjective similarity measure with 


respect to a perceptron, P ,a response unit, & , anda projection 


operator 7/7 , by 


we*(X,Y)|77, P, R= Oxy (0) fq) Qy(R) 6 1 (15.2) 


where Qzy (2) is the value of Q; ; for the stimuli corresponding to the 
objects Y and Y (under the projection 77 ) measured in the source set 

of the response unit R . Foran of¢-system, and stimuli of fixed size, 
ye*(X, Y) is proportional to the generalization coefficient Ixy ° for the 
response 2 . For two identical stimuli, 4«"(X,Y) # f . If the value 

of per(X, Y) is a monotonic function of the objective similarity of the 
objects X and Y , we would expect the response #* to generalize most 
strongly to highly ''similar" objects, and most weakly to dissimilar objects. 
Over any given subgroup of transformations of an object in 972 , this 
induces a "generalization gradient" equivalent to the use of the term in 


experimental psychology. 


A perceptron which is to simulate perceptual performance 
must have or acquire a close correlation between the subjective and 


objective similarities of objects in physical space, under the group of 
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rigid motions and some kinds of cintinuous deformation. A perceptron in 
which such a correlation exists is said to be capable of similarity gene rali- 
zation. Similarity generalization implies that the perceptron not only tends 
to generalize to similar objects, but retains its ability to respond differen- 
tially to dissimilar objects. The demonstration of such a capability will be 
our main concern for the remainder of this chapter and the following four 


chapters. 


15.3 Four-Layer Systems with Intrinsic Similarity Generalization 


15.3.1 Perceptron Organization 


The four-layer perceptrons to be analyzed have fixed connections 
except for the terminal A to R-unit connections, and a topology which is 
illustrated in Figure 40. S, A, and R-units are all assumed to be of the 
simple variety, resembling those of an elementary perceptron. The 
special features of this system (which might be called a ''similarity- 
constrained perceptron'') are the following: 

(1) Each A” 


y inhibitory input connections, and a single output connection to one of the 
(2) 
A . 


unit has a threshold 9 , X excitatory and 


units. 


) 


(2 ‘ : ‘ 
(2) Each A unit receives connections from a source 


set of m aoe units, and has a threshold equal to 1. 
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a'®) units 


TOROIDAL RETINA 


@ = EXCITATORY ORIGIN 
O = INHIBITORY ORIGIN 


! 


=+1 VALUES = + | VARIABLE 
VALUES 


Figure 40 ORGANIZATION OF A SIMILARITY-CONSTRAINED PERCEPTRON (x= 2, y=, 
m = 3). of = TRANSLATION GROUP IN TOROIDAL RETINA. 


Google 


(3) The values of all connections from all? to are 


units are equal to +1. 


a {2 
(4) All A. : units in the source set of a given A } 


unit have origin point configurations which are members of a similarity 


class, under some similarity relation X 


The subsequent discussion will be limited to the special case 
in which the similarity relation K is equivalent to similarity under a 
transformation group, ~ , in the sensory space of the perceptron. This 
means that, when an origin configuration has been picked for one of the Bee 
units connected to a given ai?’ unit, the remaining 97-/ ae units 
connected to the same a? unit must have origin configurations which 
are transforms under “ of the first configuration selected. This is 
illustrated in Fig. 40 for a casein which 972=3 , and the transformation 
group is the group of horizontal and vertical translations on the retina. 
In the model to be analyzed, it is assumed that a single template configuration 
is chosen at random for each A unit, and the ™ origin configurations 
actually assigned to the ae units are obtained by selecting ™ transform- 
ations at random, without replacement, from the group 2x” . This yields 
the auxiliary condition that no two ae units in the same source set have 


identical origin point configurations. 


* In the case considered here, the world manifold 92 and the sensory 
space ~ are taken to be coextensive, with a one-to-one correspondence 
between objects in #2 andstimuliin Jf . 
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15.3.2 Analysis 


To begin with, we will attempt to provide an intuitive basis 
for understanding the functioning of the similarity-constrained perceptron. 
At one extreme, if 92 = / , note that the system becomes functionally 
equivalent to an elementary perceptron of the binomial variety, with A-units 
having the same parameters as the Ae units in the 4-layer model. At 
the other extreme, where ™ is equal to the order of the transformation 


(1) , 
group, there is one A unit in each source set for every possible trans - 


(1) 


form of the ''template configuration", Now if one of the A units whose 


origin configuration is w responds toa stimulus S$, , any transform 


(1) 


T(S,) will necessarily activate the A unit whose origin configuration 


(1) 


is the transform J(w) . Since both of these A units are connected 


(2) 


to the same A unit, this unit will respond bothto S$, and 7(5,) , 


(1) 


since its threshold is 1, and the values of the connections from A to 


ae units are fixed at 1. Thus we have the rule that any A si unit 


A 

which responds to a stimulus S$, will also respond to all transforms 7 (5, ) 

under the group “ . Alternatively, we could state that if 5, sim Syl 
2 

and an a‘ , unit 2; responds to S, , then this unit will also respond to 


Sy + Next suppdse that in addition to making ™ equal to the order of the 


group, the threshold of the ae units is §6 = number of excitatory 
origins = area of the stimuli, and the number of inhibitory origins is 
equal to the complementary area, so that an ae unit will respond to 
only one stimulus. We then have an ideal situation, in which an ae unit 
responds to all the members of a given similarity class, and only to 


members of that similarity class. Under these conditions, if we show the 
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perceptron a stimulus, say a square, and associate a response to that 
square, this response will immediately generalize perfectly to all 
transforms of the square under the group & , and will not generalize at 


all to any stimulus which is not a transform of the square under 


The conditions considered above, where ™ is equal to the 


order of the group, and each all? 


unit responds to only one possible 
stimulus, are impractical in the extreme, for a retina of reasonable 

size. It should be clear from the above arguments, however, that even 
with smaller values of ™ (so long as m >/) and lower thresholds, a bias 
will exist for an A iad unit to respond to similar stimuli, rather than 
dissimilar stimuli, under the group £’ . We now pass on to a quantitative 
analysis of the performance of this system, first for an environment of 


random ''salt-and-pepper" stimuli, and then for an environment of square 


stimuli. 


The performance of a four-layer perceptron of the type under 
consideration can be obtained from preceding analyses of elementary per- 
ceptrons if we know the G-matrix or the Q-functions of the A) units. The 
expected performance of the system (or the actual performance of a very 


large system) is entirely determined by the functions @{? , i.e., the 


‘ 
probability that a second-layer A-unit will respond both : S; andto S$ a 
We will consider the case of a perceptron with NV, sensory points, and 
a universe of random dot-stimuli, each consisting of RN,=n, sensory 
points chosen at random from a uniform distribution. Let 7 be any 


transformation in «7 , such that the measure of the set of fixed points 
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under the transformation is zero. We will use the notation S;’ to 
denote the transform /7(S;) , and S;* to denote some other transform 
T*(S:), (7*# 17) . With this notation, oe is the probability that 


(2) 


an A unit responds to 5S; andto 7(S;) , and Q(t is the probability 


that it responds to S; and to r"(S;) 


First of all, we have 


(2) (2) , (2) 
= Q. Qs\i 


of 
¢é é 


Q (15.3) 


(2) 
where @Q;-|; = conditional probability that an al? unit responds to 5,’ 
given that it responds to S; . For the first factor of this expression, we 


have the close approximation 


a ae (rg) 
(15.4) 


) 


2 : 
y units connected to an A! unit 


This approximation assumes that the m A 
all have an independent chance of responding to stimulus S; . This will be 
approximately true if 9<< 7, for the all units. In this case, since the 
stimuli consist of random point configurations, the knowledge that an origin 
point of the first at unit falls on an active S-point still leaves 7,- / 
possible S-points in the same stimulus, any one of which might coincide 
with the transform of the origin point for one of the other 7 ie units. In 
the range of parametric conditions with which we are generally concerned, 


equation (15.4) approaches a perfect equality. 
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For the second factor in (15.3) we have the approximation 


(which is accurate for small a.” ) 


(2) m-t io m-t (1) \™ 
One x Prery + eee 1- (1-Q,);+) a " 


where w is the order of the group &% . The first term of this 
"= , is the probability that one of the m-/ 4'” units, 
other than the.one which is known to have responded to 5S; , has an origin 


expression, 


configuration which is a 7 -transform of the configuration of the 'known'' 
A-unit. There are m-/ non-identical possibilities that this transform is 


present, and w-/ transforms from which they are chosen. If this condition 


(2) 


is met, then the 4 unit must certainly respond to 7(5;) . If this 


condition is not met, with probability /- el », it is still possible that one 
of the 4‘? units responds to 7(S;) , and this probability is given by the 


last term of the above expression. Here One is the probability that an 


il unit, which is known to respond to some transform T“(S;) will also 


respondto 5S; . Since T* may be any transformation (including the 


f 
(4) units are 


identity) so long as it is not equalto 7 , all of the mA 
equally good candidates for such a response. Specifically, for the case 


under consideration, 


ns 

(1) (1) (1) 

Orit =D, lng) O:4ie (ne) = E Opi" 
n=O «< 


(15.6) 
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where 7. = the number of common sensory points in 5;’ and 5S;* , 


with probability 


Ping) = (74) P™ (1- py” "e 
na-f 
WO Se es 
Na-‘ (15.7) 


Note that the probability » that a point in S-* is in the common area is 
based on ‘,-/ possible locations, since it cannot occupy the location of its 
transform in S-’ ; however, there are ,-/ other points in S;’ whose 
locations it might occupy. The only quantity which we still lack is 

Or; «(n,) which is given by 


(1) Q,;7;4 (ng) Q;; (C) 
Qiie(ne) = an > are 


where Q;; (C) is computed from Equation (6.5) with C = ne /No : 
Substituting, we have 


ee CC ae CC) 


é Ay (C) 
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Note that as ‘, , the number of retinal points, goes to infinity, (with 


Ne VA Na constant) this quantity approaches 


Q;; (27) 
Q; 


which is equal to Q:- for the binomial model. At the same time, the first 
term of (15.5) goes to zero if m remains finite and the order of the group 
increases with the number of possible retinal locations of the stimulus. 
Thus, for an infinite retina and a transformation group of infinite order, 
we have 


(2) 
Ql; = so") (15.9) 


and 
2 
(2) | (1))™ 
oe = 1 -(1-9; ) (15.10) 
2) 

which is identical to the expression for Qi for a pair of random, 
unrelated stimuli. Thus, with an infinite retina, no additional generalization 
is to be expected from a random stimulus to its transform under the conditions 


assumed above. For a finite retina, however, (or fora finite group & ) 


we have the inequality 


due to the effect of the first term in equation (15.5). 
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Let us now turn to a modification of the above problem, in 
which the environment consists of square patterns with edges alligned 
in a square (toroidal) retina, and the group “ consists of all possible 
translations. In particular, we will take the transformation 7 to be 
a lateral translation by half the width of the retina. The notation 5S;’ 
will be used for 7(5;) , and 7” will be taken to mean any transformation 
in £ not equalto 7 and not equal to the identify transformation. For 
convenience, we restrict the area of the stimuli so that & < .25. This 
guarantees that S- and 5S;’ are always disjoint patterns. es is 


again assumed to be small. In this case we have, in place of (15.5) 


t) fr (r- aise) EC a M-OH2, (Oi) | 


(2) m- 


/ m- 
ole ‘ad aa; i rere 


Q 


where the expectation is with respect to selections of transformations 


such that 7: (S;) = Si ' 
To avoid the computation of this expectation, we make the 
further approximation that the expectation of the product of the above 


sequence of Q-functions is equal to the product of the expected values of 


the Q-functions. Now it can be shown that for any distribution of Q-,| ie? 
EIT (1-Q) € ITE(1-Q) = 77(1-EQ) 
It follows from this that the approximation which we now propose to make 


2 
will be a conservative one, yielding values of ay which are slightly 


smaller than they should be. With this approximation, we now have: 
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(2) m-{ m-1 (a Art a) 
eH = Behe Get) [ool (ala) 


(15.11) 
since the "known" 4°” unit which responds to 5; has the conditional 
probability 

(1) Q; (0) 
a Mace cae 


of responding to the disjoint transform. S;’ . The expression for 

9." if is again given by (15.6), only the probability /(n,) is different 
from the random stimulus case. A general equation for /(n,) will not be 
developed here, for a finite retina; in particular cases, it is obtained by 
counting all of the possible ways in which a square and its translate can 


intersect to yield ™. common points. Some numerical examples will be 


va 
considered in the following section. Note that the modification from 

Equation (15.5) to (15.11) will have the effect of tending to diminish the 

value of O.0 for small values of ™ , sothatfor m=/ _ the generalization 
to a disjoint square will always be less than the generalization from a 


square to a random stimulus of the same area, which is still given by 


2 
(2) m 
Qu" [1-(1-0/”) (15.12) 
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If we go to the limit of an infinite retina, (and infinite trans - 
formation group) with the environment of square stimuli just considered, 
the results differ considerably from the random stimulus case. The 
difference is due to the distribution of the common area, C , which, in 
the case of the random stimuli, went to Rp? with probability 1. In the 
case of randomly placed square stimuli, the probability of a zero inter- 
section in an infinite retina is given by 

4h" 


P(C =O) =1- a (15.13) 


where 4 = length of edge of square, 
width of retina (r 2 24). 


vw 


The probability of O<C <q will be 4/r? times the area under the 
hyperbola y = q /x from y=0O to 4 . Specifically, 


ls 


P(O0<C €q) = | 


Differentiating, 
_ 4 4° (15.14) 
Ple=9) = 75 [an (4 ) 
2337 


Google 


{ 
Thus, for a square stimulus of area & ina retina of area / (R< ge) 


we have 


R 
(1) f Le (1) (1) 
Orit =O | Exo | PM) 01; (C) dc + (1-4R) Qi; (0) 
C#€ 
R 
_ 4 Lim =] (1) ia (1) 
€ 


(15.15) 


(2) 
Substituting this in (15.11) yields an expression for Q;; for the infinite 


) 
retina, and O can be computed by (15.3), as usual. 


152333 Examples 


Figure 41 illustrates the behavior of a similarity-constrained 
perceptron, as a function of ™ , for various combinations of retinal 
size and types of stimuli. The transformation group, in each case, 
consists of all horizontal and vertical translations ina aquare: toroidally 
connected retina. The stimuli considered are a pair of independent 
random-dot stimuli, S, and 5S, , a square stimulus 5, , and the trans- 
forms S,° , Sg’ » where the transformation employed is a shift of 
half the width of the retina. This guarantees that the square stimulus S9 
is disjoint from its transform Sq’ . All stimuli have an area Ff equal 
to one fourth of the retina. The parameters of the Yd units are 


x=y=4, oO = 2 
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-25. TRANSFORMATION GROUP OF HORIZONTAL AND VERTICAL 
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FOR 4LAYER SIMILARITY-GENERALIZING PERCEPTRON. 


(2) 


Q: 


J 


TRANSLATIONS. S;' = S; DISPLACED BY HALF-WIDTH OF RETINA. 
O gle 


g" = 2, R; 


| 


C 


Figure 41 


The bottom solid curve provides a baseline, with which the 
other conditions can be compared. This curve is identical for Q,, 
(both stimuli random and independent), Qaa! (a random stimulus and its 
transform) where /V, _ is infinite, and Qe0 (a square stimulus vs. a 
random stimulus). In a small, finite retina however (specifically, with 
Nz = 36) a random stimulus will generalize more strongly to its 
transform than to an independent random stimulus, for any ™>/ . 
This is shown by the upper of the two solid curves. The broken curves 
illustrate the generalization from a square to its (disjoint) transform, both 
for the 6 by 6 retina, and for the infinite retina. In both cases, we find 
that the system generalizes more strongly to a random stimulus if ™ is 
small, but that as m is increased, the perceptron begins to generalize 
more strongly to the disjoint transform than to a random, unrelated 
stimulus. For the infinite retina, the cross-over occurs between ™ = 4 
and ™ = 5. This means that fora Z%-system, with 25, 92; will 
be positive from a square to any other square, and will be zero from a square 
to a random dot stimulus. Increasing the threshold of the A units will 
reduce Qs - for all curves, but will increase the relative bias towards 
similar stimuli, and will shift the cross-over point further to the left for 


the curves. 


99" 
The difference in performance for squares as opposed to 
random stimuli will tend to be characteristics of any coherent stimulus 
patterns, provided the transformation group is one which preserves the 
coherence, or compactness, of the stimuli. This may be puzzling to 


some readers who recognize that under the connection rules employed 
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in these perceptrons, there is nothing unique about topologically connected,, 
Or continuous regions, which would affect the perceptron's ability to 
recognize them in any different way than disconnected regions. It is, 

after all, only the set of points to which connections happen to be made 
which determines the response of a perceptron, and if every S-unit were 
randomly interchanged with some other S-unit, a corresponding change 
being induced in the stimulus environment, the performance of the 


perceptron should not be affected at all. This will indeed be true, 


provided any transformation group employed in the first perceptron is 
replaced by a new transformation group corresponding to the rearranged 


retina. The essential feature of coherent stimuli with a group of coherence- 
preserving transformations is that the probability distribution of stimulus - 
intersections does not concentrate at the expected value of the intersection, 
as NN, and the order of the group become infinite. This permits a 
similarity bias to be maintained for such stimuli which cannot be 

maintained for random stimuli. Any group generated by a permutation 
operation on the points of the retina will have the same property, provided 
the same permutation operation is applied to the stimuli. Another way 

of looking at the problem is to note that with random stimuli, a sensory origin 
point which is close to a stimulus point, but does not coincide with it 
exactly, has a probability of being activated no greater than that of any 
other origin-point. With coherent stimuli, on the other hand, an origin- 
point which is close to a stimulus point has a greater probability of being 
activated than one which is remote from the stimulus point. Thus, for 
random stimuli, only a transformed origin configuration which corres - 


ponds exactly to the transformation 7 will help in generalizing from. 5 
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to 7(S) . For coherent stimuli, it is sufficient that the transformed 
origin points should be in the neighborhood of the required transform; 
proximity to the required transformation is sufficient to increase the 


* 
probability of being activated by 7(S). 


Note that as m increases, the value of Q-: j tends to approach 
unity for all curves in Fig. 41. This means that there will be a maximum 
similarity bias at some finite value of ™ , beyond which the advantage of 
similar over random stimuli will approach zero. By increasing the value 
of 6 for the re units, the location of the maximum bias can be shifted 
further to the right, until, with © = xn, , the maximum will occur at 


m= w 


15.4 Laws of Similarity-Generalization in Perceptrons 


The results obtained in the previous section illustrate a 
number of effects which are found quite generally in perceptrons which 
show a capability for similarity-generalization, regardless of whether this 
Capability is learned or intrinsic, and regardless of whether the perceptron 
is series-coupled or cross-coupled, Additional evidence for these general 
results will be found in subsequent chapters, and they appear to take on 
the status of empirical laws, which have now been substantiated for a 
rather wide variety of systems. These laws can be tentatively stated as 


follows: 


* The effects noted here are directly analogous to those originally 
predicted for cross-coupled systems in Ref. 85. 
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(1) As the size of the retina increases, it becomes increasingly 
difficult to recognize the similarity of two random-pattern stimuli under a 
given transformation group, with a finite perceptron. With an infinite 
retina (and transformation group of infinite order) the similarity bias for 


random stimuli goes to zero. 


(2) The similarity-bias for coherent stimuli, under a 
coherence -preserving transformation group, will generally be stronger 
than for random stimuli, and will not go to zero even for an infinite retina 


and transformation group of infinite order. 


(3) The similarity bias of a perceptron can be increased 
by raising the threshold of its A-units or by increasing the number of 
connections to terminal A-units (i.e., generalization will be limited 
increasingly to the members of a similarity class, as the threshold or 


number of pre-terminal units is increased). 


(4) Generalization to disjoint transforms of a stimulus 
may be less than generalization to independent random patterns, for a 
perceptron with weak similarity bias; generalization to disjoint transforms 
can be made to exceed generalization to random stimuli, however, by an 
increase in A-unit thresholds or by increasing the number of inputs to 


the terminal A-units of the network. 
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16. FOUR-LAYER PERCEPTRONS WITH ADAPTIVE PRETERMINAL 
NETWORKS 


The physical universe, at a macroscopic level, is characterized 
by the continuity of its transformations through time. Objects do not 
suddenly appear out of nowhere, persist for an instant, and then vanish 
into nothingness. Given an appropriate time-scale, all changes appear to 
occur smoothly and progressively. Consequently, stimuli which are highly 
similar under a continuous transformation group are more likely to occur in 
close temporal succession than dissimilar stimuli. In this chapter, it will 
be shown that an initially unbiased perceptron can take advantage of this 
property of the physical environment to evolve a capability for similarity 
generalization, without any intervention by an experimenter or reinforcement 


control system. 


The model which is presented here was developed jointly by 
Block, Knight, and Rosenblatt, in the hopes that its analysis would assist 
in the understanding of closely related problems which occur in cross- 
coupled systems. The similarity between the performance of this sytem 
and the performance of cross-coupled systems is most striking, as will 
be seen in later chapters. The main effects of cross-coupling will be to 
accelerate the adaptation process, and to make the system inherently 
responsive to stimulus sequences, rather than momentary stimuli. The 
presentation in the first parts of this chapter is essentially the same as 


that of Block, Knight, and Rosenblatt (Ref. 7). 
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16.1 Description of the Model 


The perceptron to be analyzed is illustrated in Fig. 42. It is 


a four-layer series coupled system, with an equal number (N,) of gee 


(2) * 
units and A units. Each yet unit receives a variable-valued 
connection from each of the A ed unita. In addition, each al?) unit 
/ 
receives a fixed-value connection from one of the a! } units. For conve- 


(1) 2 
nience, the A and A‘ : units are placed in one-to-one correspondence, 


2 
with the fixed connection to each A‘ / unit originating from its ''mate'' in 


f | 
the a layer. The threshold of the A units is gf! , and the 


(2) 


(2 
threshold of the A : units is 9O To simplify notation, we will use 


the symbol @ to designate 9?’ » unless otherwise indicated. The fixed 


(1) (2) 


connections from A to A units all have values > QO. For 


specificity, we assume that all of these fixed values are exactly equal to 9 


f (2 
The variable-valued connection from an A‘ : unit @, to an A‘ ; unit a: 
has a value u; (¢ attime ¢ . The symbol u;;. will be used to designate 


4) (2) : : 
to A connections, and v;, to designate values of AY 


values of A 
to R-unit connections. The input connections to the Al units may be 
organized according to any of the models (e.g., binomial or Poisson) which 
were discussed in Part Il. Signal transmission times, 7;; , are assumed 
to be equal to zero, for all connections. It is assumed that stimuli occur at 


times ¢, f+At, t+ 2At , etc. 


The numbers of units need not be equal for systems of this type to 
work; the constraint is introduced in order to simplify the analysis. It 
is equally satisfactory to or ganize the perceptron with ™ variable 
valued connections and 1 fixed value connection to each A‘ unit, 
with origins chosen at random. 
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The variable values «; j are assumed to be initially equal to zero, 

and change with time as follows: If unit an is active attime @ and a” 
is active at t+ At , then u,, receives an increment (y-At) , and all 
connections &;; decay by a quantity (f-At) u“;; . The values of the a 
to R-unit connections may be varied by any one of the usual reinforcement 
rules. Note that under these rules, the values &; j will always be non- 
negative, so that if the ''mate" of a given an unit is active, the ? 
unit will always be active. In the subsequent analysis, it will be shown 
that with a suitable sequential organization of the environment, these 
dynamic rules can lead to the development of a perceptron organization 
closely analogous to that of the similarity-constrained perceptrons of the 


previous chapter. 


16.2 General Analysis 


16.2.1 Development of the Steady -State Equation 


As in the last chapter, our main concern will be to find the 
values of ay » which will permit further analysis to proceed along the 
lines employed for elementary perceptrons. Unlike the perceptrons of 
Chapter 15, however, the values of Qi , and consequently the G-matrix 
of the perceptron, are stochastic variables, depending upon the prior 
history of the system. 


(1) 


The set of A-units inthe A layer responding to $; will 


be denoted by A (S;) ; the set responding to both S; and 5S; is 
A'?Cs;) Nn A” (S;) . For a perceptron with a known connection 
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scheme for the Ae layer (or for a sufficiently large perceptron) the 


Ae units responding to both S$; and S; will be Qi? : 


fraction of 
and is equal to the number of elements in A(s;) nas.) divided 
by Ng -. These quantities are fixed for all time. 
L (i) , ; . (2) 
et cf, (t) denote the total input signal to the unit @, 


* 
attime ¢ , in response to stimulus §; . Then 


Na 
oe (2) = 6a%(S;) +>, up, (t) ay (5;) 


reo! 
(16.1) 
. : (1) 
where r 1 if S; activates a4 
a4(S;) = ee 
0 othe rwise 


This represents the sum of the signal arriving at @, on its fixed 
connection, and all of the signals arriving on the variable-valued connections 


attime £ . Let 


Bs” = 6a5(S;) (16.2) 
wy, <A 
¥ 
% (t)=) uy, (t) a7 (Si) (16.3) 


ral 
Then 


066 "(t) = 80) + Ot) “ee 


The indices ¢ , j , and & will be used throughout this chapter to 
designate various stimuli, and the indices 7 and 4 will be used to 
designate particular A-units. 
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Note that Bo is 6 or O depending on whether af!) is in A" (5,) 
or not; it is invariant with time. On the other hand, ret) represents 


A? Ae 


the effect of the variable to connections. 
Now suppose that at time 2, stimulus S; occurs, and at 
time ¢,+A4f stimulus S, occurs. Then the consequent change in uy, 


will be 


uy, (ty + 2At) - u,,(tot At) = (7-At) ap (S WK (o0%e, 4t))-(6-At) up, (t+ At) 


where 0 for x < @ 
P(x) = 
lfor xv 2@ 


(16.5) 


From (16.3) and (16.5) we get 


; ; Na 
rite + 2) - 7"(t,+at) =). [ups (tot 2At)- upg (to + At)] ay (5;) 
r=] 


=(”-At) ¢ [aM (2, +At)| x ap (S;) a2 (S;) - oan Up, (tor Atay (5;) 
Hence 
7 tg + 2ht)- Tt, + at) = (-At)p[oc(e, At) Na Qj CAF Gert) 16 6) 
where, for brevity, the subscript 4 has been suppressed. It must be 


remembered that 97 and o¢ , in these equations, refer to any particular 


Am unit, ae ‘ 
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Now suppose the sequence of stimuli {5,, , 5j, s cealtacg Ss; } 
occurs at the successive times ¢, ¢+d4t,..., t+ MAt . In Equation (16.6) 
wetake t,=t+mdt, [m=0,1,2,.--,(M-)],j “jm 4 
and obtain 


™Jmel? 


7 (t+ (mer) dt) - -3' “t+ (met bt) = (16.7) 


(” -At) ¢g fac Smee). (m+ nat)| N, Qu; -( At) “ts (m+ 1)At) 


Summing on ™ from 0 to M-/ we get the change in 7? due to the 


entire sequence of stimuli: 


7 Ot s (Me dt)- x@(t+At) © > {im pt) pla Yme(e4 (m+ nat) ei. 


m=O 
‘ .8 
-(-at) 5 (t+ (menrat)} (hee) 
We now divide by M4t and let At approach gene” to obtain 
(¢) M-i{ 
ad N (s) 

or 3 ( Mail $ (oc oc imet)iey) gi! a soy). ib) 

m=0 


An alternative treatment is possible in which difference equations are 
carried throughout, rather than converting to a differential equation. 
The true solution for ¥ ) obtained from such an approach is a 
fluctuating function, the local time-average of which corresponds to the 
solution of the differential equation, which is obtained here. As long as 
y and ¢ are sufficiently small, the differential equation, which is 
somewhat easier to manage-yields a close approximation to the true 
solution of the finite difference equation. 
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Let F;4 be the number of times the stimulus pair 5S;54 occurs 
in the given sequence Sj, Sj, ee Sj, ; also, let fj, = Fia/M be the 
average frequency of the pair S; Sg. Then from (16.9) we get 


x! é) 


-~ < (A), 0) GB) 
“2 D_ (Na) Fig Gc (t))Q;; -o7'"(t) (16.10) 
pf Ref 


where ~ , as usual, represents the number of distinct stimulus patterns in 


the environment. Defining the matrix C QF _, with elements 


n 
(1) 
Cy; “2. Q;; f Aj 


we have from (16.10) 


dg” “(War 2. 6 ¢(38 74) - ont) (16.11) 


¢ 
This gives us a non-linear system of differential equations for 7‘ Xt),..., 7X) 
with initial conditions xo) =O 


If the frequencies f4; vary with ¢ , then the coefficients 


C;; are time-dependent, but in any case they are non-negative and 
bounded; g is non-negative, monotone increasing in 7 , bounded and 
continuous on the right. It will be assumed here thatthe C,;- are 


J 
constants (corresponding to fixed frequencies, taj ). 
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In preparation for discussing the solution of (16.11), consider 


the equilibrium equation 


© Kok xr Ci; g(3 + zr) 
yal (16.12) 
This corresponds to a solution of (16.11) for the steady-state condition in 
which the rate of gain (represented by the first term of 16.11) is exactly 
counterbalanced by the rate of decay. But the system of equations (16.12) 
may have more than one solution. However, we shall show that there is a 
unique minimal] solution (by which we mean a solution none of whose compo- 
nents 7” 4) exceed the corresponding components of another solution); and 
this minimal solution is obtained in a finite number (at most ” ) of iterations 


of (16.12), starting with all 7 O- onthe right-hand side of the 
‘) 


equation, finding the new values of al from (16.12), putting these back 


into the right-hand side, and so on. Thatis, wetake JU = 0 and 


Mi 
(i) Na” WW) (4D) 
Tent ee D_ <j (8 +In ) 
gal (16.13) 


We shall prove firet that this process terminates in at most n” 


iterations. This can be seen from the following considerations. Since 

the right-hand side of (16.13) is non-negative and zi! =O », it follows 
{ é 

that a? 2 re . Now since the right side of (16.13) is a non- 


decreasing function of the 7.'s, it follows that 7f? > ae ore 
es 2 7! . Therefore, also g(a + ri) > g(a%s 7) 


that is, successive g "sg cannot decrease. If, at a particular step, no g 
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increases, then we are ata solution. The ) 's have only the values zero 
or 1, so even if only a single g changes at each step, the process terminates 
in at most ” steps. 
The solution thus obtained will be denoted by 7" We 
shall now prove that this solution is minimal. Let zo be any solution of 
the ee equation (16.12). Then for the iteration process (16.13), 
we have 7; \¢ ze » forall ¢ . Since the right-hand side of (16. 13) 
(é) 


is a monotone function of 7 » we have 


6) fl n ae i 
ia fi cis (4 BO) 8 PED 04; HF) = FO 


y* LF , 
Similarly, 7,{” < 7% whence ZO" FO | pence 7" is 


minimal. 


To avoid consideration of a special pathological case, we now 


make a mild assumption. Consider the sum Melt On C3; taken over a 
subset R of the possible values of (1, 2,---52A a“ We assume that no 
such sum is equalto @ . This is not a serious assumption, since by a 


small change in Not this requirement can always be satisfied. 


Now suppose that the x Ve) satisfy the system of 
differential equations (16.11) and the initial conditions g00) = 0 
Then we assert that the x “Ct) are non-decreasing and 
gain 7 Os) aad zo" - That is, the solution obtained by the iterative 
process (16.13) is indeed the solution of the differential equation (16.11), 


with initial conditions zero in each case. 
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First we shall show that 


4g ) 
ae > 0. Moreover, if ge (t) > 0 , then 
fe) 
Ar (me) 
fe oe 


As a preliminary step, consider the nature of the solution of 


dz 

a =M-dx ,where™ and ¢@ are positive constants, 
M MM -¢t, 

and 7(0) =. » where Cf ac< a The solution, x = Tr ee ae au -a) ; 


has the appearance of the following curve: 


the equation 


The solution approaches “//c monotonely from below, and f2/:/t > J 
forall ¢>0O . Ifattime ¢=¢,  wereplaceM by M,>™_ the 


solution appears as 


S as 5| 
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as ¢ goes from O to &, the solution approaches M/o monotonely from 
below; as ¢ increases beyond ¢, the solution approaches M,/¢ monotonely 
from below. The solution is continuous; so is its derivative, except at #2, 


where the left and right hand derivatives are not equal, but both are positive. 


If instead of >a =O ,wetake M=@=O , the solution 
is x(t)=0O for Ost Sf, 


We now proceed to the proof of (a) . Let 
nn ° . 
mf? = Na” 2 Cy; (a+ v “¢4)) . Then (16.11) can be written 
J™ 


dae! 


— wen) ae 
7 M(t) <7" (t) 


(16.14) 
where here and in the following paragraph, { is a generic index of the set 
(1,2,...,”) , while J and & will refer to specific indices to be 


defined below. 


Each equation Me) can take on at most 2” possible 
values. Let 4 bea specific value of é and suppose first that mo) =O 
The only times at which Mea) can change its value are when one of the 

7 (indeed one whose corresponding BY = Q ) reaches the value @ 
Suppose the first time at which this happens is ¢, > O . Suppose then 
that z(t) = @ . Since in the interval QO <€ < t, all ia = 0 
we have mcg) 2 mete) . Thus the solution 7%) appears 
as in Figure (b) above; in particular, for all 4 such that mo) >O 
we have 7AG,) < aor < z_hied ; and for the others 
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x @Ce,) < t,t . Furthermore, since both the left and right 
derivatives of 7° t,) are positive we have, for ¢ >t, and sufficiently 
close to ¢, , fie, >@ , 80 that it-will not be until ¢, , with ¢, >t, . 
that there will again be a a t) having the value 6 . In the interval 

t, < t < t, we have the Pra nt conditions as we had in the interval 
Fan me (i) 


ae (t,) -07 “(t) , vie initial values 
; (é) &) 
re) < ey) and in particular ze) oe Thus in 


Azo a daz) 
7 >O0O , and he > O 


The same argument applies to successive intervals (¢,,¢;),(t;,t4) ; 


the interval ¢, <¢ < ¢, we again have 


and soon. Since the 7% t) are monotone there are at most 7 such 


intervals. 


wn Mo) =0 , then 7“"t)=0 for O<t <t,. I 


(t,) > O , then we use the previous argument starting at ¢ = ¢, ; 


‘ 
otherwise 7 (4, remains zero at least until ¢, , and soon. In any 


M (a) 


case, the statement (c) has been proven. 


Next we shall show that 


ime Oe Fe eh (A) 


t-© 00 


Since, from the proof of (oc) itis clear thateach 7 8) 


is monotone aad aoe ie, PL aly ° 1 t) exists; callit 7 ue ; 

it is a sum of the form Mad 5 Cy » which was assumed at the 
; SR ; 

outset to be unequal to a » and aes Ree + @ . Therefore, 


(J) a (su) # 


$(A iy v w ) is continuous when Y o . Letting ¢ +> co 
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in equation (16.11) we see that ’ ca is a solution of the equilibrium 
. . aw 
equation (16.12). Hence 7 ad => 7 (ey* » since 7 (6) is minimal. We 


next show that forall £20 , ree) Pe xy 


Note that initially 7'(0) < 7'“?" —. Suppose that ¢, 
is the first time at which some Aes) = 7 (4)* . From (16.11) and 


the fact that @ is non-decreasing we see that at t, , 


FE a ie, st (4) 
Fe > a 35 (B+ Bt) - 00 (ty) 
J 


IN 


J 


4)* 4) 
age - op“) = 0 
&) 
| dg | 
i.e., dt =< 0 at t=t,. 
# (4) 

If z 4 > O , we have from (x) that A >Jat ¢, , which is a 
contradiction. Suppose that gy (4)* =0O ,» sothatalso ¢, =O . Then, 
s "ye 

aslongasno 7 8) reaches a non-zero rd » we have 


n Nn 
4 2 e . e “7 # 
MA) = Nand, Ce, 6(3% ey) 2 mnd c:; o(as pO agg -— 
onl yal 
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it 
Hence over this period 7° Be) =0 . But no non-zero 7) can 


ever be attained DY | sd (t) , since, by the above argument, we 
‘ 


would have SO atthe first time this occurs, in contradiction 


dt 
to (ec)  , 
Hence if gies O ., then rw < 7" ; and if 
“ye : 
aed = 0 , then 7 Oe) 2 7 i In general, x ) < oe we 
Hence 7 eae dim 7) <s 7" , and (8) follows. 


From this point on, we shall be concerned with the steady-state 


#* 
a) , and for brevity we shall drop the * . In the terminal 


condition, the A-unit ef” » whose history we have been following up 

to this point, is activated by S; _ if 4+ re? 26. The set of 4) 
units which are activated by stimulus S; are denoted by Aa™5;) . In 
the initial state, the set As.) is denoted by A (S;) » and in the 
terminal state by Ao (S;) . The expected fraction of A units which 
are activated by both S; and S- will be @;” and is equal to the expected 


number of units in A‘? ($;) A% i) divided by N, - 


values 7 


2 
Once the Qi are known, the behavior of the perceptron in its 


terminal (steady state) condition can be predicted. To determine these 


terminal values of Qi? » we can proceed as follows. First, the set of 
AY units is broken into the smallest possible cells of the Venn diagram 


which represents the sets of units responding to different stimuli (c.f., 
Fig. 43). For the units in each of these cells, there is a characteristic 
fS -vector. Foreach such 4 -vector, we solve equation (16.12) for 
7) 


the terminal values of Here we assume thy to be given, and 
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oe can be obtained from previous equations (as in Chapter 6). Initially, 


(2) (1) 
OF; ™ Qj 


region of the A 


Knowing 4 sis and "aad » we can determine the 


(2) (2) 


Venn diagram to which each cell of A units moves. 


Thus we obtain the complete terminal distribution of A-units in the Venn 
(2) » and hence in particular the QF 
that the motion will be for A-units to tend to go into higher-order intersections, 
but that points which are initially outside all the ae $;) will stay outside 


allthe 4'7s;) 


diagram of A . It can be seen 


16.2.2 A Numerical Example 


To clarify the above description, an illustrative example is 


worked out here numerically. Suppose there are three stimuli, S$, . 5, , 
and S, , which initially activate sets of A units (or sets of ae units, 
which will be equivalent under starting conditions) shown in the Venn diagram 
of Figure 43(a). Here the Qn matrix, and the initial value of the oe 
matrix is 
oo l 
1 .4 3 
1 3 6 


Suppose the sequence De Sy S;, ji 
ie) 
5S, S, S, S$, S, S3 S, S53; S, S$, + This is repeated over and over 


e4 a » from the above analysis, is 
mm 


during the training, or "preconditioning" of the perceptron. Then the ae 


matrix is 
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Figure 43 (a) VENN DIAGRAM OF INITIAL A(2) SETS, FOR ILLUSTRATIVE EXAMPLE. 
10 A-UNITS, DISTRIBUTED AS SHOWN. 


Figure 43 (b) TERMINAL VENN DIAGRAM, FOR 7/o = © =/. 
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391 03) 04 .10 .04 
ae a5 1 4 3 274] = 1.14 107 .05 
1 3 200 18.06 .04 


The equilibrium equations (16.12) then become 


x! ; 4° 1.0 .4 9(4'+ 7") 

2 _ 2 2 
a er 1.4 st es vy) ae ae (16.15) 
rn 1.8 .6 A g U3 + 7) 


Now we begin to trace the destinations of cells of the Venn diagram of 
Fig. 43@). Start with the two A-units which are activated only by S, 
Here B= @(!,0, 0). The first iteration of (16.15) then gives 


7 4 

gy?) = 2 {1.4 
3 

z 1.8 


If n/t < 0/1.8 , then these 7's are zero, and the points in question 
stay in the same resion of the Venn diagram. To be specific, 


let us take y/o = @=1]. Then we get for the first approximation 


i: 


x 4 

gy?) =| 10.4 

x? 1.8 
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and for the second iteration, 


zg! 1.8 
a 

w = | 2.6 

7? 2.8 


which is the fixed point. The two associators in question have consequently 


moved into the triple intersection of the Venn diagram in Fig. 43. 


Continuing in this fashion with each of the eight cells of the 
Venn diagram, we finally arrive at the terminal distribution shown in 


Fig. 43(b). For this we have the terminal Q-matrix: 


-6 6 .9 
The stimuli S; and S2 have become indistinguishable. The G-matrix for 
an O¢ -system is the same as @.t", while for a 7 -system, it would be 
24 .24 .06 
244% .24 .06 


\ -06 .06 .09 


The "coagulation" of S, and S, corresponds to the fact that in the training 
sequence (which is reflected in the f;; matrix) S, and S$, follow one 
another quite frequently, whereas they are very rarely followed by S, 
Consequently, Sy tends to remain distinct, in the terminal G-matrix. 

In the following section, it will be seen that such behavior is quite character- 


* 
istic of this system. 


* Another numerical example will be found in Section 17.2, where the 
four-layer system is compared with an open-loop cross-coupled model. 
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16.3 Organization of Dichotomies 


The general analysis of the preceding section can be applied 
to a large variety of particular experimental designs. To begin with, we 
will show that with a suitable choice of parameters for the perceptron, and 
a suitable sequence of stimuli, a perceptron can spontaneously dichotomize 
an environment into any two classes, without any control of the reinforcement 
process by an external agency or experimenter. The organization of the 


* 
stimulus sequence will determine the particular dichotomy which is formed. 


Let the sequence of stimuli to which the perceptron is exposed 
be Ge. Gp awdy JS: . In the following discussion, such a sequence 
Yo’ “4 JM 
will be called a "preconditioning sequence". Let P; denote the fraction 
of occurrences of S; in the given sequence, and let P74 denote the 
number of times Sg immediately follows §; divided by the number of 
times 5; occurs. Then fig = hes Pi4 %; - With a sufficiently long 


sequence, (Mt 1)/M~t , and the equilibrium equation takes the form: 


an a 
ek pee 5B Ba Hae 7) 


(16.16) 


where P; corresponds to the probability of Si» and Pg corresponds to 


the transition probability /Preb. { 5; — Sg} = Prob. { 5; Sal s;} . 


* This can be interpreted as an R-controlled reinforcement system, 
although it does not actually depend on the outputs of the R-units in 
any essential way. 
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EXPERIMENT 10: Take anenvironment, W , consisting of n stimuli, 
such that there is no appreciable difference in the retinal 
overlap of different pairs of stimuli. (With a large retina, a 
set a random dot stimuli will generally satisfy this condition. ) 
Divide the stimuli arbitrarily into two classes, so that 
Sy Sree Se are in Class X , while Seep yr 5. are in 
Class Y . All members of a given class are equally likely to 
occur. Let the probability of transition to a member of the 
same class be g , nearly unity, and to a member of the 
opposite class be /-g , nearly zero. Let the perceptron be 
exposed to an extended preconditioning sequence composed 
according to these probabilities, without any control by the 
r.c.s. At the end of the preconditioning sequence, the perceptron 
is exposed to a short additional sequence composed in the same 
manner, during which R-controlled reinforcement is administered, 
according to the rules of the J -system, for A-unit to R-unit 
connections. The values of all connections are then ''frozen", 
and the response of the perceptron to each stimulus in W 


is ascertained. 


It can be seen that this experiment is closely analogous to 
Experiment 9, in which the effects of R-controlled reinforcement were 
determined for an environment of horizontal and vertical bars, except for 
the preconditioning sequence (which would have no effect at all in a simple 


perceptron), and the additional condition that there is no way of determining 
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whether two stimuli belong in the same or opposite classes on the basis of 
their retinal overlap. The only thing which characterizes two members of the 
same class differently from stimuli of opposite classes is the difference in 


transition probabilities in the preconditioning sequence. 


We assume a, « (9 + a;;) / No , where 4>0, 920 
Thus the diagonal elements of the QW? matrix are all ( 3 + 4)/ Ng and all 


other elements are g /' Ng . (Note that by raising thresholds of the A? 
units, with a sufficient number of connections, the ratio 9g /-4 can be made 
as small as desired.) For the probabilities of stimulus-occurrence indicated 


in the experiment, we have 


1/2K for Sj in X 
Ai 1/2L for §. in Y 


where Len-K 


p/K for S; in X, S% in X 
(i-pyft for S; inX, S¢ in Y 
Aik pft for S; in Y, Sg in Y 
(I- yk for S$; in Y, Sg in X 
Then we obtain from (16.16), 
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(16.17) 


7 < LY o(0% ye) po (3 (gi) baie ran Dp 4 (3 7) 


&=/ A=] 


Let us now assume that 5S, _ is one of the stimuli of class X 


Then 
(x) 4ptgk (4), Fated 4(1-p)+gk (4), 7! 
7 oa 2K 2 peg 0(4 }+ 2KL 2 a(4 ) ene 
&=K+! 
We now observe the following: 
i) If _wlaprgk) > @g then Ao (S4) = Ao (5) » 


In words, if the stated inequality holds then, in the terminal 
condition, each of the stimuli of class X activates the union of all sets 
which were initially activated by any of the stimuli of class X . That is, 
each stimulus of a given class has "captured" all of the A-units that initially 


responded to all of the other stimuli of that class, The proof follows from 
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the fact that any A ie unit which originally responded to any of the 


stimuli in class X contributes a non-zero term in z, in (16.18). 
The postulated inequality then guarantees that the A-unit will be active in 
the terminal state. 


y[ai-p)+gk | (2) (Die. 
ii) If =a <@, then AZ (S,) S ae Ay (S;). 


In words, if the stated inequality holds then, in the terminal 
condition, no stimulus of class X activates any A-unit outside of the union 
of sets initially activated by stimuli of class xX . The proof follows from 
the fact that, if we were to solve (16.18) by iteration, then any A-unit 
which is activated by none of the X-stimuli has, on the first iteration, no 
contribution from x - In virtue of the assumed inequality it will not 
have any contribution on any following iteration either, and c& remains less 
than 9 . Since only a finite number of iterations are involved, this unit 
does not become active. 


(2) 


2) 
iii) If the inequalities of (i) and (ii) both hold, then A’ al 


(Sy) = 56x oO (S;). 


Necessary and sufficient conditions for both (i) and (ii) to 


hold have been found by H. D. Block. They are 


a) a> gK(K-1) 
b) p> [ea + gk(K-1)] [a (ke) 


c) K“/(ap + 9k) < 7/260 < K/(a(1-p) + 9k) 
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Condition a) insures that a probability plo< p< 1) can be chosen to 
satisfy b). Condition b) insures that /7 / 208 can be chosen to satisfy c). 


The conditions can be written in the alternative form 


a') p> K/(kK +1) 
b') A> 9k(K-1)/[p(K+1) - K] 
c) as above. 


Under the conditions indicated, if Experiment 10 is completed 
by exposing the perceptron to a continuation of the same stimulus sequence 
with R-controlled 7 -reinforcement, the first response to occur will 
immediately generalize to all stimuli of the same class as the one which 
evoked the response, since each member of the class activates the identical 
set of A-units, after the preconditioning sequence. Suppose a member of 
class X is the first stimulus to occur, and that this happens to evoke the 
response f*=+/ . Then this response will be reinforced, and will 
generalize immediately to all other members of class X . However, 
under the conditions assumed above, the intersections between the sets of 
A-units initially responding to stimuli of class X and stimuli of class Y 
were all equal to g , and it was noted that by using large thresholds, g 
could be made arbitrarily small relative to the measure of the responding 
A-sets. If each A-unit has a large number of distinct origin points (no two 
identical ) g can, in fact, be made small relative to the product Q; Q; 
Thus, with a large threshold, ina 7 -system, the generalization coefficient 


9;; for 5; inX and S; in Y will be negative. Consequently, any 
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stimuli of class Y will automatically be assigned the opposite response 
from stimuli of class X . Thus a completely consistent dichotomy has been 
created, from the time the first stimulus of the terminal training sequence 
occurs. Further reinforcement will only strengthen the tendencies thus 


established. 


If the ratio 77 if f¢ is made large enough, the perceptron in 
Experiment 10 will ultimately arrive at a state in which every stimulus 
activates all A-units which ever responded to any stimulus of either class. 
However, in practice, the constraints on the parameters need not be as 
severe as those indicated in conditions a), b), and c) above, in order to 
obtain useful generalization effects from the system. As long as 0 /o 
is not so large as to cause a complete merging of all A-sets for all stimuli, 
it remains possible to teach the "preconditioned" perceptron to discriminate 
all stimuli of the two classes correctly with « single corrective reinforcement 
for one stimulus of each class, as long as the inequality epee gi. 20 


2o¢K2 
is satisfied. 


16.4 Organization of Multiple Classes 


Suppose we have the same kind of environment as in 
Experiment 10, but that the stimuli are considered to be of, say, three 
classes: 

Ayr Agrees Ay 1 By s Ba reees By Gs Coneees Cpy (K+L+M =n). 
We assume there is not too much overlap between the different types of 
stimuli, an assumption which will be made more precise below. (as in 


the previous case, the overlap can always be reduced as far as required 
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by making 9 sufficiently high.) The three classes will be called X, Y, 


f 
and Z . We assume that the or matrix is 
9 /'No if S; and S; are in different classes 
(1) ( ) N if : d Ss: s h 1 S: . 
Coe g+r)/Ng S; an j are in the same class, 5; + S| 


(g+r+4)/No ik Sf oes 


From the nature of a @;; matrix it is necessary that 9 2 0, (g+r) 20, 


and (r+4) >O . Weassume 4 > O 


Suppose that the transition probabilities are large (p) for 
transitions to a member of the same class, and small ({/ - p)/2 to each 


of the other classes. Within a class each transition is equally likely. Then 


p/K S; in X, Sj in x 
p/t Sj; in Y, 5S; in Y 
p/M S; in Z, Sj in 2 
= (1-p)/2L Sj; in X, 5; in Y; or S; in Z, S;in Y 
(1- p)/2M S; in X, S; in Z; orS; in Y, Sin Z 
(1-p)/2k S; in Y, Sj in X; or S; in Z, Sj in X 
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The probabilities of occurrence of individual stimuli are given by 


1/3K S; in x 
P;=4 1/3L S; in Y 
1/3M S; in Z 


Then Equation (16.16) becomes 


ntl TFT FoF EE TEE aon 


JEX MEY jex heZ jeY heX VeY key 


LEE EER EEZ|[s’ouetr®] 


JEY MeZ jeZ hEeX jeZ hEY eZ 46Z 


where, for simplicity of notation, X , Y , and Z have been used for the 


appropriate index sets. Suppose Z is in X (i.e., Sy isin X ). Then 
(16.19) yields 


) 9 } plkres)+ gk (A), (a) acs 
6) ~P 4 
v 7 {ethers 2, 9(4 +7 ) 


(i-piikr+a)+2gk [4 a) ee 4), (4) 
i 7K E pS oe + 2,94 ae } 


We can now assert the following: 


(Kr+4)+4K] (2) (2) 
i) If le(kr+s)+ ok) 20g » then the set A, (Sz) = s;ex Ao (S;) 
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That is, if the stated inequality holds, then every A a unit which initially 
responded to any stimulus in class X now responds to each stimulus in 


So unit which 


class X . This is readily proven by noting that for any A 
initially responds to any of the stimuliin class X there is at least one 
non-zero @ in 2 in (16.20). The postulated inequality then guarantees 
that 7’ > ‘for any x auchthat S, is in x 


. (1- p)(Kr+a)+ 22K (2) 7s. 
ii) If ”|- ZKG <@O ,then Ay (SA) ox A> (S;) 


That,is, if the stated inequality holds, then every al?) unit which did not 


initially respond to at least one of the stimuli of class X does not respond 
to any class X stimulus in the terminal state. This is proven as follows. 
For an A-unit which does not respond to any stimulus of class X _ , none of 
the terms in ), in (16.18) are present on the first iteration, which 
starts with af o = 0. The stated inequality guarantees that, even if all the 
other terms are present, no x for S; will reach 9  . Thus no terms 
in D. will ever be non-zero. 

eX 


iii) If both of the above inequalities hold, then AS (Sx) = U Avis.) 
SEX 
J 


2 
That is, each stimulus in class X activates exactly the same set of a‘ 


units in the terminal state; and that set consists of just those A-units which 


originally were activated by any one of the stimuli of class X 
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Necessary and sufficient conditions that the inequalities of both 


i) and ii) be satisfied have again been derived by Block, and are 
a) r > -4/K 
b) 9 < (Kr+ a) / K(k - 1) 
c) p> [2gK(k-1) + K(Kr+a)] / (Ke + a)(K+2) 


d) 3K 7/[p(kr+a)e 9k| < n/eo < 6K/[(1- p)ikr+a)+ 29k | 


Condition a) guarantees that a suitable g > 0 can be chosen 
in b); Condition b) guarantees that a suitable p< f can be chosen in c); 


Condition c) guarantees that an e |b can be chosen to satisfy d). 


If the parameters are suitably set we have seen that the response 


2) 
inthe 4/* layer to any stimulus in class X is Z a A’) (S;) . Similarly 
Re 
for classes Y and 7 . This meansthata 7 -system perceptron witha 


single R-unit will tend to assign the same response to all members of the 
first class of stimuli to be represented in the training sequence. All other 
stimuli will receive the opposite response, if the initial inter sections of 
responding A-sets are small enough. With more than one R-unit and inhi- 
bitory connections between the R-units, so that only one can go on at one 

time (c.f., Chapter 20) it is thus possible for the perceptron to assigna 
unique response to each stimulus class. If there is too much initial overlap 
between the responding sets of A-units, or if condition i) is satisfied 

without condition ii) being satisfied, a single corrective reinforcement applied 
for any one stimulus of each class may still be sufficient to yield the correct 


response for all stimuli in the environment. 
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16.5 Similarity Generalization 


In the experiments considered above, the nature of the stimulus 
classes was never explicitly stated. Clearly, they could have been 
similarity classes, under a suitably chosen similarity relation, and the same 
results would have been obtained. In order to obtain generalization over the 
entire class, however, it was assumed that "runs"! of stimuli from each class 
occurred, it being much more likely that a stimulus was followed by another 
member of the same class than by a stimulus from a different class. After a 
long preconditioning sequence of this type, it might be expected that the 
perceptron would have seen each stimulus in the environment a great number 
of times. We now consider the generalization of a similarity relation to 


* 
stimuli which have not occurred during the preconditioning sequence. 


EXPERIMENT 11: Consider an environment of stimuli S, perer Sa geeey S, 
and their transforms 7(5,), T(5,),...,7(S,) where 7 is 
any transformation in which the measure of fixed points is zero. 
Let the perceptron be exposed to a preconditioning sequence, 
consisting of stimuli followed by their transforms, i.e., a 
sequence of the form { Sa, »T(Sg )s Sgos T(Sg)) poose Sg T(Sg)} 
where the subscripts #,, &2,... are picked at random 
from the set of integers 1 through nm . Now consider a pair of 
test stimuli, S$, and Sy , and their trandorms 7(5,) and 
T(S y) » none of which occured during the preconditioning 
sequence. Let one response be associated to 5, and the 
opposite response to Sy » by means of an error correction 
procedure. Now test the perceptron to determine its response 
to 7(S,) and (Sy) 

* This is directly analogous to the phenomenon of similarity generali- 


zation originally predicted for cross-coupled systems in Rosenblatt, 
Ref. 85. 
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It is predicted that if this experiment is performed with random 
dot stimuli in the preconditioning sequence, with a finite retina, and 5S, 
and Sy, are any other stimuli (e.g., a square and a triangle, or two letters 
of the alphabet) the transforms 7/5,) and T/( Sy) will each tend to activate 
the appropriate response, which was associated to S, and Sy » respectively. 
In other words, the perceptron will have learned that any two stimuli which 
are similar under the transformation J are to be treated as equivalent, 


even though the stimuli have never been seen before. 


To begin with, we consider the following problem, which is 
essentially a special case of Experiment 11, performed with only a single 


test stimulus. 


Consider the stimuli 5, , 5,,.--, S, and their transforms 
Set = TOS), Sign = T(Sy) 5-0) Say = T(Sy) . For example, 
Syy+++, Sy may be in the left half of the field, and 7 a transformation 


which moves them to the right half of the field. 5,(x=2K+/) is not 
shown during the preconditioning sequence, but is a test stimulus to be 
applied later. Sy' = (Sy 1, x#’= 2K+2 =0n . Letus assume S, 
intersects S,,---, 5,(L<K) toa larger extent than it does the others 
and hence 5S,’ intersects mainly the stimuli S,,,5---» S,x,, - These 


relationships are illustrated in Figure 44. 
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RETINA 


Figure 44 RELATIONSHIP OF TEST STIMULUS TO PRECONDITIONING STIMUL| 
AND TRANSFORMS 


Specifically, consider the conditions 


Q (1) (9+ 40z;)/No yore 
oe (9+ 7)/No fet 
(1) #/No oe 
Ox; = 4 (q+ r)/Ny K+elsjeKel 
(g+4d,1/N, J>Ke#el 


In the preconditioning sequence, a stimulus S- 


is picked at 
random from S,,..., 5, , and this is followed by its transform, T(S;) 
Then another stimulus is picked at random from 5,,..-, 5, and this is 


followed by its transform, and so on. Then 
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oe 
O J > 2K 
f JU <K, &=Ke+y 
Pig = 1 /K J >kK, &€K 
, O otherwise 


We also specify that no A-unit is activated by more than “4 (u< K/L) of 
the stimuli Si» Sa yeeey Ske ° 


From Equation (16.16) we obtain 


kK 
(x) Na” (1) (Key) (key) (1) Le 
Ge 2, 5 0(2 +e }ta ap ¥ a, o(3" (16.21) 
yzl =K+! = 
(x) K 7, 7" ') < ( y zi" 
x" iG arn ola ty ve oy Ae" Ly K+ +9 Daas dv: a 
yzltl yal 


Hence we have the following results: (16.22) 


i) if (grr)/2Ko >@ , then Ay (Sy) DAL(S,) + U Ag (TS). 


In words, if the stated inequality holds, then, in the terminal state, 5S x 
activates all those elements originally activated either by itself or by any 


of the transforms 7(S,),---,7(5, ) 


cs (2) (2) (2) 
ii) If EE (K+ u-L) <@ ,then Ay (S,)&A, Sera AS). 


2 
iii) If both inequalities hold, then A = A's y) + U A? (r(s-)) : 
JVeL 
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Thus far, we have considered the generalization of a response 
from S, tothe transforms /7(5,), T(S,) , etc. Suppose a response 
is aesociated to T(S,) ; we are then interested in determining whether 
there is any generalization in the reverse direction, i.e., to 7: oe 
(X’) 


Wecan obtain 7 from Equation (16.21), with x replaced by 2z° , 


which yields: 


fs =o Yr E o(a rs gd) =e Kq g(a + *) 
rr 


Consequently, 
iv) If oS E (K+) + se <Q, then AS) z A's.) . 


If inequalities i), ii), and iv) all hold, then the stimulus Sx generalizes to 
FCS) etic’ ere, , but the transform [7(S,) = Sy‘ does not generalize 
to the stimulus S,. Necessary and sufficient conditions that all three 


inequalities hold are easily found: (With r 20 , then iv) implies ii) ). 


a) ro> gk (Kt -1) 


K-Ly 
2 
b) 2K Z ” g 2k 
g+r 60 K(K+tu)gtlry 
In particular, let L=/ . Then RCS) = RS) cA CHS) : 
Thus, due to the intersection between | Ao (s2) and ACES) , 


the test stimulus generalizes to its transform, even though neither the test 
stimulus nor its transform has occurred during the preconditioning sequence. 
Under these conditions, the perceptron will behave in much the same manner 
as the specially constrained similarity-biased perceptron of Chapter 15. The 
actual magnitude of the bias thus induced, in a simple discrimination experi- 


ment, can be calculated as follows. 
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Let $3 be another test stimulus, like 5, , but its chief 


intersection is with S, , sayalso 9+. Then if conditions a) and b) are 


satisfied, (with L=/) , Ate (Sq) = Ag’ (53) t AG (T(S,)) and 
AS (T(5,)) = Ans: )) . Suppose the perceptron has zero 


initial values on the Ai?) 


to R-unit connections. Let S, be shown, and all 
active A-R connections reinforced by +/. Then let 53 be shown, and all 
active A-R connections reinforced by -/. Now if the perceptron is shown 
T(S,) (which it has never seen before) the input to the R-unit is equal to 
the number of A-units in A (r(s,)) N[A@(r(5,)) vAG'(s,)] 

minus the number of A-units in A‘ (T(5,)) [AZ?(7(s,)) y Ag? (5,)] ’ 
which in general is positive; while if it is shown T(S3) the signal to the 
R-unit is negative. Thus the discrimination which was taught for 5, and 


S$, Carries over to 7(S,) and T(S3) ; 


4 
In the above analysis, it was postulated that the test stimuli 
should have larger intersections with some of the preconditioning stimuli 
than with others. This assumption is crucial for the predicted effect to 
occur. The reader will recall from the discussion of the last chapter, that 
in a perceptron with an infinite retina, no similarity bias could be obtained 
between random stimuli because the distribution of their intersections had 
zero variance. The same situation holds here. If the preconditioning stimuli 
are random dot patterns, and the retina is infinite, then every preconditioning 
stimulus will have exactly the same intersection with the test stimulus 5 xe 
and the required bias cannot occur. In a finite retina, however, the inter- 
sections will be binomially distributed (as in the analysis of Chapter 15), and 


the predicted effect will be obtained. 
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We also note an advantage, as before, if compact, coherent 
stimuli are employed for preconditioning and as test stimuli. In this case, 
even in an infinite retina, the distribution of intersections will have non-zero 
variance, and the test stimulus will tend to be more closely related to some 
preconditioning stimuli than to others. As long as two test stimuli, 5, and 
°3 


degree, they can be discriminated in the terminal state of the system (provided 


, do not intersect the same sets of preconditioning stimuli to the same 


the required parametric conditions are satisfied), but each will generalize to 
its transform. Thus the claim made for the performance of such a system in 
Experiment 11 has been verified in principle. Quantitative studies of actual 
cases are not yet complete, but similar experiments with cross-coupled 
systems (to be presented in Chapter 19) suggest that highly satisfactory results 


can, in fact, be obtained in practice. 


The asymmetrical generalization from S$ to 7(S) , but not 
from T(S) to § can, of course, be overcome by employing a symmetrical 
preconditioning sequence, in which a stimulus is as likely to be followed by 


=f 
the inverse transformation, 7 (S) asby 7(S). 


For instance, take pee Sp, } = ere enero = 
eer Sei a re ee cS) where K = n/2 . Let 


Qe, = (q+40;,)/Ng 
Pp. = 1/2K 
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Pp J < kK, y-) =Kty 

Pp J > Kk, 4 = yj-K 
oan 

(i-p)/(2K-1) jf €K, AK, 

(imp) /(2K-1) j >kK, &45-K 


Let ur = p-(f-p)/(2k-1) ;thenthe Pg canbe 


expressed as follows. For {<j £€ K, 1€ 44K, wehave 


ae mma e ae ee eed 


PP Rem ae ae: > rt urd Z 


where f = (/ - sr) | 2K 7 (1- p)/(2k- 1). This means that the transition 
probability from a stimulus to its transform, or vice versa, is r+#+w , 


while for any two unrelated stimuli, the transition probability is r 


Then from (16.16) we have 


(4+K) 


) 


_ Nan a (#) yy 
2 dD. > @: Pp D(a) + dD, D, 2, PAK D (0 


kK K 
(1) yr (8) (1) 
+ DD jek Bebe Pe) + DD One an, tend 
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foc’ +k) 


yy 
/ 


Assuming S,€ {Spry See} we have 


; K K 
pees ” dD. D>. (9 + ads) (ro (oc) + (r+urdig) gi) 
J=l #=! 


+ q((r+ wd) @ (oc?) + ro (aiA*K))) 


; kK K 
oe a = > 2gr [o(olt)) + 0 (a!4”)] gar da (lel*™) + 0(0")) 


+ adr (roa) + (r+udg) b (o.'***))) 


x? 7 as (2Kqr + gur+ ar) : acer) b (x laa TTn d (cx vay) 


Thus if p (or uw ) is nearly | and 4/q is large, S- will 
generalize to its transform, and conversely 7(5;) will generalize to 5; , 


since 
gy (Kt) ft (ain epaiear) : (a) « $ (« tens + Taw g b (xx!*?) 


To get the specific form of the conditions for such generalization to occur, 
K 


we extract the term for #4=/ in )" and put it with the secondterm. This 
A=/1 
gives the first required inequality, 


(7/2Ko)(2kqr +Qur+ art sur) > 
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or, replacing f and w interms of p , and 2K by 1” , we get the 
condition 


(2) 


NS RAS (TCS). 


i) If n(q+4p)/an >@ then AGS) 2 A, 
The second required inequality turns out to be 


(7 (K-1)/2ka)(2kKgr +Qur + 4r)<i@ 


or, replacing 7 and ~ interms of p , we get 


ii) If 2 (n-2)(9 (n-eall- ph] /2n(n-1)8¢8, then Ped = A? 5:) 4 A(r(5;)). 


(2) 
e) 


(2) 


iii) If both inequalities hold, then As.) =A, (S;) t Ay (T(5;)). 


Necessary and sufficient conditions that both inequalities hold, given 7 > 4, 


are 
a) p > (n-2)/(3n-4) 
b) g< 4[p(3n-4)-n+2] /(n-1(n-4) 
c) 7/0 must be so chosen as to satisfy i) and ii). 


For n=4 , these conditions are satisfied if p> 1/4 and 


4/(9+ 4p) - CO < 12/[39 +4 (1-p)] 
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16.6 Analysis of Value -Conserving Models 


In dealing with simple perceptrons, a single value-conserving 
model, the 7 -system, has been considered. In this system, the total 
value of the set of input connections to an A-unit is conserved. In four- layer 
and cross-coupled perceptrons two types of value-conserving systems are of 
interest: the 7% -system, defined as beofe, (where the sum of the input values 
is held constant) and the /" -system, where value is conserved over the set of 
output connections from an A-unit, rather than the inputs. In the perceptrons 
to be considered in the following chapters, this second system appears to offer 
important advantages in performance, and willgenerally be preferred over the 


J -system. 


The most important difference between the 7 -system and the 
[” -system is that the latter tends to activate those A-units which would 
respond to the most probable successor of the present stimulus, whereas the 
o -system tends to activate the set of A-units which respond to the stimulus 
for which the present stimulus is the most probable predecessor. The 
difference between these two situations can be seen from the following example. 
Suppose there are three stimuli, A, B, and C, with transition probabilities as 


shown in the following diagram: 


A 
/.0 
B =a G 
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In this case, with the 7 -system, we would expect the set of A-units 
responding to stimulus A to become most closely associated to the set 
responding to stimulus C, since A is the only possible predecessor of C, 
whereas B can be preceded by either Aor C. Ina / -system, on the other 
hand, the set responding to A would be most closely coupled to the set 
responding to B, and might even develop inhibitory connections to the set 
responding to C, since B is the most common successor of A. Thus the 

/" -system tends to be predictive, tending to anticipate the most likely 
successor of the present stimulus, whereas the 7 -system tends to antici- 
pate the stimulus which is most likely to be preceded by the present stimulus. 
As shown above, this latter choice is not necessarily a good prediction of 


the next event. 


16.6.1 Analysis of 7 -systems 


The differential equation for the 7 -system is identical with 


(16.11), except that the constants Cry are now equal to 


nn 
Ci 2 (0:8 - ener fa; 
=f 


The negative term, - Q-Q, , is familiar from previous analyses of the 

Z -system, and represents the quantity substracted to balance the gain 
in value of the active connections. It will be recalled that for a Poisson 
model, @:4-Q;@, is always equal to or greater than zero, so that the 
expected value of Cri will remain positive, and the previous analysis 
(Section 16.2.1) applies without modification. More generally, however, and 


for a binomial model in particular, the Cry may be negative, and the 


previous analysis must be reexamined to see how this affects the situation. 
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To begin with, it no longer follows that the solution will be 


'S may 


monotone, since different combinations of positive and negative C S 


be picked up in equation (16.11), depending on which @'s are currently non- 
zero. Since the solution is non-monotone, it also does not follow that a 
solution will occur in 7” steps, or that the solution of the iteration equation 


(16.13) is minimal. 


While we are unable, at this time, to provide any short-cut 
method of finding the steady state solution (if one exists) forthe 7% -system, 
it is possible to compute a time-dependent solution by the following procedure. 
We note, first, that the solution is piecewise exponential, as in the case of the 

o¢ -system, and that the time constants for all , di are equal. This means 
tu) will be the first to cross the level 
of © , by computing the initial asymptotes, mi? forall ; . The 7 ) 
with the highest value of | de 

(J) 


that we can readily determine which coz 


will change most rapidly. If the initial 


value of o “= 0 , and mM? is negative, p (oc?) will mmediately go 


to O . If noM is negative, then the first change to occur will be for some ¢ 
to change from 0 _ to 1, and this will occur for that / for which mi? is 
greatest. Having thus obtained the first discontinuity point, ¢, , we can 
(¢) 


Xe 
compute the values of all 7 ‘(t,) , and determine the next ? to change. 


This is done by computing the function 
(¢) 
M o) 
egy) 


gE! _ oh (16.23) 
4 (0-B)- 7 (t4) 


Joseph has pointed out that singularities are possible. For example, with 
O=!, f=!1, 4,71 ,and 4,=-=0, if Cai) we have (at t = £n 3/2 ) 
m= %, 7-1 . Butthen 7, = 2-7, while 7, =- 72. Thus 7, 
immediately falls below 1, hence back to the original equation, which brings 
it back to 1 again. While 7, thus fluctuates about 1, the future history of 
%, is not determined. 
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for alli. Note that E will be greater than / only if the numerator and 
denominator agree in sign, and (M/o - 7) > le -{- Z | . If these 
conditions are met (i.e., if &, > 1 )» (x) will change value some- 
time before hae reaches its new asymptote. Thus, by finding the value 
(or values) of i for which gs is maximum, at the discontinuity time fy, 
we can always determine the next @ to change. Introducing this new 
gives us a new set of asymptotes, M,,, (7) » and the process can be 
continued. The values of the 7 oy t) at the discontinuity times can be 


readily calculated from the exponential solution: 


; - - (i) 
T (tars) = “A aie Se a(S - '(t4)) (16.24) 


where the discontinuity time, ¢g,, , is obtained by solving the equation for 


the next gt to cross threshold, that is 


f3-O+ “s 
(tg. 4-tg)=-ghe9 PPA, 7 (16.25) 
4 (¢) 
eo (ta) 


16.6.2 Analysis of / -systems 


The / -system is similar to the 2’ -system, except that 
after each increment of reinforcement, the total value is restored to its 


former level by subtracting the net gain uniformly from the set of output 
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connections from an A-unit, instead of the input connections. The differen- 


tial equation now takes the form 


(t) 


ae ») () ‘ 16.26 
dt = Na”? pas 2. | (oe) - 94] Q;) iysor ( ) 
7 = 


The same uncertainties as to existence of steady state solutions 
and difficulties of computation occur here as in the case of the % -system 
analysis. A time-dependent solution can again be computed, piecewise, by 
the same procedure as above. In chapter 19, we shall reconsider the 


/* -system, in connection with cross-coupled perceptrons. 


16.7 Functionally Equivalent Models 


In Ref. 41, Joseph has presented an analysis of a perceptron with 


"binodal A-units", which is now seen to be functionally equivalent to a variation 


of the system analyzed above. In the binodal model, there is only a single 
layer of A-units, but each A-unit receives two logically distinct sets of input 
connections and has a separate threshold for each set. The first set of 
connections is fixed in value, and activates the A-unit according to the usual 
rules. The second set consists of a single connection from every sensory 
point in the retina, and is variable in value. The reinforcement rule for 
these variable connections is that if the A-unit is active attime 7¢ , and the 
retinal origin point of one of the variable connections is active at t+/ , the 
variable connection gains an increment in value. At the same time, all 
variable connections tend to decay at a fixed rate, od . This is equivalent 
to a four-layer model in which each A‘ unit receives its fixed connection 


() unit with a normal number of input connections and threshold @ 


t 
and receives variable connections from MN, other A : units, each having a 


from an A 


single excitatory input connection, and a threshold of 1. The main difference 
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ae unit responds to the 


from the above analysis would then be that the A 
logical sum, rather than the algebraic sum, of the inputs from the fixed 
connections and the variable connections, i.e., the al?} unit is active if its 
fixed connection (the /@ -component) is active, or if the sum of the variable 
connections (the 2 -component) > 6 . As this writer had previously 
predicted on heuristic grounds, Joseph has successfully demonstrated that 
similarity generalization will tend to occur in the binodal model, after a 
preconditioning sequence analogous to those discussed above. In this system, 
the set of fixed connections acts as a "'template'"', and the variable connections 
tend to adapt themselves to an origin configuration which resembles the fixed 


set under the transformation T. The reader is referred to Reference 41 fora 


quantitative analysis. 


While it was assumed that the models analyzed in the preceding 


(1) 


sections had a complete set of connections (from every A unit to every 


2) 
A! unit), a system which merely has a large number of input connections 


(2) (1) 


to each A unit, Originating from randomly selected A units, can be 


geen to be equivalent in all of its essential properties. For such a system, 
the oe matrix, representing the expected values of the fractions of Ae 
units responding to 5; and Sj » would have the same equations as before, 
except that VV, must be replaced by the number of variable connections to 
each A unit. 

In the following chapter, it will be shown that a form of weakly 
cross-coupled system, in which there are no closed loops, is also virtually 
equivalent to the model analyzed in this chapter, and can be represented by 


the same equations, with a slight reinterpretation of the /% -component of 


the input signals to the A-units. 
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17. OPEN-LOOP CROSS-COUPLED SYSTEMS 


The most interesting features of cross-coupled perceptrons are 
those which result from the possibility of closed feed-back loops, or 
cycles, in the network. It is possible, however, to design a cross-coupled 
system with no closed loops, and such a system has a number of important 
features, including the ability to act as an adaptive similarity-generalizing 
system equivalent to the perceptrons of Chapter 16, and increased economy 
and versatility in general classification problems of the sort considered in 
Chapter 5. These properties will be considered briefly, in this chapter, 
before proceeding to closed-loop systems, which represent a more challenging 


problem in analysis. 


17.1 Similarity-Generalizing Systems: An Analog of the Four-Layer System 


The three-layer perceptron shown in Fig. 45 is directly comparable 
to the four-layer system considered in the last chapter. The A-units are 
divided into two subsets, called A' and A". All A-units receive fixed 
connections from the retina, but only the A" units have connections to the 
R-units, the A'units sending their output signals to the A'' units. Each A'unit 
is connected (in a fully-coupled model) to all A'' units, and each A" unit is 
connected to all A' units. The rule for modifying the connections from A' 
to A" units is identical with the rule for modifying al) to al?) connections, 
in the four-layer system considered previously: If the origin of the connection 
is active at time t, and the terminus is active at t+], the connection gains a 


quantity 7 . All inter-A-unit connections decay ata rate Sf , as before. 
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Figure 45 OPEN-LOOP CROSS-COUPLED SYSTEM (COMPARE Figure 42). BROKEN LINES 
INDICATE VARIABLE CONNECTIONS. 


Clearly, the only differerence between this model and the 
previous one is that the 3 -component, instead of originating from one 
of the a”) units, comes direct from the retina, and consequently can take 
on more than two values. The differential equation (16.11) and the equi- 
librium equation (16.12) thus apply without modification to this system 
(where the A' set is equated with the all) set, and the A"' set with the 
al?) set). The additional freedom in choice of 4 -values means that the 
sets designated 47's.) » representing sets of units whose /3 -value 
in response to 5: is +/ , must now be fractionated into subsets for 
each possible value of J , and the history of each such subset (having a 


given /9 -vector) must be followed separately, Thus the full designation 
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of such a subset would be Ay B;, S;) . Apart from this further 
fractionation of the A-set, the same analysis holds as in the last chapter, 


and much the same results would be expected. 


- 17.2 Comparison of Four-Layer and Open - Loop Cross -Coupled Models 


A numerical comparison of the performance of the perceptrons 
considered in this and the preceding chapter will be based on the following 
experiment: 


EXPERIMENT 12: Take an environment of four stimuli, 5,--.5, , each 
having retinalarea &=.2 . The intersections C,, and C,, 
are each equalto ./ , and all other intersections are zero. The 
perceptron is exposed to the following sequence, which is 
repeated until a steady state is attained: 

(5, Sp Sy Sp 5, S$, S152 S$, $2 53 Sq 53 Sq Sy Sq 53 545554 )+ This sequence 
can be considered to consist of two events, the first consisting of 
the alternating pair 5S,5,5,5,... withaduration of /0T , 
and the second consisting of 5, 5, 5; 54-++» also with duration 
of /O0tT . A matrix of Q;; functions is obtained at the 
beginning and end of the preconditioning procedure, to compare 


steady state with initial conditions. 
The relationship among the four stimuli can be seen from the 
following Venn-diagram of the retinal sets, where the double-headed arrows 


indicate the oscillating pairs of stimuli, and the number in each cell 


indicates its area. 


-393- 


Google 


CX KD 
ee 


The initial and terminal Q-matrices have been computed for a four-layer 
and open-loop cross-coupled perceptron, as a function of the parameter 
No? ve ow . In both models the parameters of the al) units (or of all 
A-units, in the cross-coupled case) were zx =3, y=0, and O=2, 


@°?) 


with a binomial model. In the four-layer model, was also taken to be 


2 , so that the systems are directly comparable. 


The Q-matrices obtained in this experiment are shown in 
Tables 5 and 6. The important Q-functions are also shown graphically in 
Fig. 46, as a function of the parameter WV, r/o . Note that for both 
models, there is a considerable parametric range within which generalization 
is much greater for stimuli which belong to the same event than for stimuli 
from different events. This gain in generalization between 5 , and S, , 
and between S; and 5, is more than sufficient to offset the handicap of 
the intersections between S, and S; » and between 5S, and S4 » which 
gives the system an initial disadvantage. The cross-coupled model, while 
it follows a similar history, has a considerably greater "useful range" 


than the four-layer model, For the four-layer system, the range of 
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TABLE 5 
Q-MATRICES FOR FOUR-LAYER cc PERCEPTRON IN EXPERIMENT 12 


(PARAMETERS: x = 8, y= 0, O= 2) 


INITIAL Q-MATRIX: 


S232 
8 
2 
FE 


TERMINAL MATRICES FOR: 


77.0 < Wa Vf, < 88.9 


88.9 < Ng 7/, < 166.6 


a 


104 .070 .034% 000 
070 .17% .000 .034 
034.000 .104% 070 
000.034 «4.070178 
& 140.034 4 
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TABLE 6 
Q-MATRICES FOR OPEN-LOOP CROSS-COUPLED oc-PERCEPTRON IN EXPERIMENT 12 


(PARAMETERS: x= 3, y = 0, @ = 2) 


-108 .000 .034% .000 
-000 .10% .000 .03% 
-034 .000 .108% .000 
-000 .034 .000 .108 


INITIAL Q-MATRIX: 


TERMINAL MATRICES FOR: -122 .018 .03% .000 


018 .10% .000 .034 
38.5 <N_ Uy <4e.s 


Me 2 
3 
co] 
2 


W4.5 < Na UY < 77.0 


77.0 <Ne Vp < 83.3 


83.3<N_Y/, < 88.9 


88.9<N_ UY <117.6 


-176 .210 .072 .034 


117.6 < Ne Y < 166.6 


166.6 < Ng WJ < 235.2 


Na fp > 235.2 


223 
F 5 $3 a ae — 
Sea a Na Ns Na Ns ea 
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(b) OPEN-LOOP CROSS-COUPLED MODEL 


(a) 4-LAYER MODEL 
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: : : : : { ; 
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: ; ty H : i 
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eeneveccece 


eeeeccoce 


cocerecepocccce o 
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i- ooete-e sosepones SN tten Be? veesesee s x ea@eaeeseeeeen : eesesenao ' eeeesane + @gaeesnoase ; 2esnoeneeee : eeBaeceean + @eeenanese 
: H ; ‘ ; ‘ oo & ’ i : $ H H 
: { : ; = : H : : : 
eee eee: a ; 
7 ee eeesece Ht ancenene +> ececcene eal wewccvace , weecaenee + @enescee accsceescee ensecaeuse y eecsence > weed ace gewencsasstocfescoeagh caceceas 
: ; : : : : ; : : : : $ : 
; ; i : : : : i : : 
; : : : : : : : i 
aaa ares airman Scat aa preethaees ieee ide H ers r dala ee reernee Soererese aeons rr + 
; : H H e : 6 
; H : ‘ ' : ‘ H 
; 
2 8 & 8 8 = & = 2 8 8 & 


Nea” 
aa ae 


COMPARISON OF 4-LAYER AND OPEN-LOOP CROSS-COUPLED oac-PERCEPTRONS 
ON EXPT. 12. ( x = 3, y =0, © = 2 FOR BOTH SYSTEMS) 


Figure 46 
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Na a/o for which the system tends to classify events "correctly" is 
77.0 to 166.6, while for the cross-coupled model this range is extended to 
38.5 to 238.2. Thus the cross-coupled model begins to show the generali- 
zation effects earlier, and saturates later than the four-layer system. 
Moreover, the transition occurs more gradually, in eight steps for the 


cross-coupled system as opposed to three for the four-layer model. 


The matrices shown here assume co -system reinforcement. 
A v or /f -system, with the four-layer model, eliminates 
all A? activity immediately, in this experiment. In the cross-coupled 
model, however, activity is not completely eliminated, and the terminal 
Q-matrices obtained for a 7 -perceptron are shown in Table 7. Note 
that the bias favoring @,, and Q,, is eliminated for most values of 
Noa n/a , and that the ''dynamic range" is greater than inthe c -system. 
The / -system, illustrated in Table 8, is similar tothe 7 -perceptron 


for small values of Nag n /o , but it appears to ''saturate'’ more easily. 


While the performance of the cross-coupled perceptron closely 
resembles the system in Chapter 16, it is a somewhat more satisfying 
model from the standpoint of biological plausibility and parsimony, since 
it does not require the assumption of a special set of fixed connections 
from AS to a? units in addition to the variable connections - an 
assumption which was necessary, in the four-layer system, to provide a 
"template" for the organization of similar All) units to be connected to 
each A(2) unit, and in order to prevent all connections from decaying to 
zero value. In the present scheme, all S-A connections are fixed, and 


all other connections variable, yielding a conceptually simpler organization. 
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TABLE 7 


Q-MATRICES FOR OPEN-LOOP CROSS-COUPLED 7'-PERCEPTRON 
IN EXPERIMENT 12 


(Parameters: x= 3, y= 0, @ = 2) 


-108 .000 .034 .000 
000 .108 .000 .034 
INITIAL Q-MATRIX: .03% oe -108 .000 


000 .038 .000 .104 


TERMINAL MATRICES FOR: -008 .000 .001 .000 
0o< o” < 68.7 
a a 


68.7 < “et < 85.8 


N 
85.8 < “82 «io .008 .008 .019 .012 


& .008 .012 .012 
(8 .022 .014 .01% 


-O16 =.01% =.022 6.022 
O16 = .014 =.022 =—.022 


-025 .025 .017 .017 
025 .025 .017 .017 
O17) .017) =~.025 = =.025 
017, .017 .025 .025 


-030 .030 .030 .030 
-030 .030 .030 .030 


Mw NN LL NLC Ne” NN?” nS” 


-030 .030 .030 .030 
Na? 300. .030 .030 .030 .030 
d 
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TABLE 8 


Q-MATRICES FOR OPEN-LOOP CROSS-COUPLED 7 -PERCEPTRON 


IN EXPERIMENT 12 


(Parameters: X = 3, y = 0, @=2) 


108 
000 
IMITIAL Q-MATRIX: 034 
-000 
TERMINAL MATRICES FOR: -008 
-000 
0< “av € 58.5 001 
-000 
-009 
N -002 
58.5 < “oF < 77.8 .002 
-002 
.019 
aN 012 
77.8 < < 88.5 .008 
.008 
022 
015 
a”? 
88.5 < < 92.0 O14 
O14 
025 
Na” 025 
92.0 < o <€ 131 .020 
.020 
028 
a”? .028 
i3l< < 18! 026 
- 026 
.030 
a” .030 
> 181 030 
.-030 
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-000 
104 


-03% 
-000 
- 10% 
-000 


00) 
-000 
-006 
-000 


- 002 
-002 
-009 
-002 


-008 
-008 
-019 
012 


-O18 
018 
-022 
015 


-020 
-920 
-025 
025 


026 
-026 
-028 
-028 


-030 
-030 
-030 
-030 


It will be seen in Chapter 19 that this system, with the addition 
of a unit time-delay (all 7-- = / ) performs identically to a closed loop 
fully cross-coupled perceptron for the first two cycles of operation. By 
further extension of the network along the same lines, it will be shown that 


additional cycles of closed-loop activity can be duplicated. 
17.3 Reduction of Size Requirements for Universal Perceptrons 


In the case of simple perceptrons, it was demonstrated that in 
order to obtain a "universal perceptron", in which a solution exists for any 
classification of 7 stimuli, at least 7 A-units are required (Theorem 3, 
Corollary 2, Chapter 5). Now consider an open-loop cross coupled perceptron, 
constructed as follows: Let the A-units be numbered in series a,, @2,,..-, Qn, 
and let V, = N, (the number of S-points). The last of these units, an , 
has an output connection to an R-unit. Each A-unit has a variable-valued 
connection from every S-point, plus one connection for every A-unit prior 


to itself in the series; i.e., Qa; receives a connection from every S-point 


and from QA,» a> yy, © 2 @ y Qr-4 


It has been demonstrated by Cameron’ that for small values of 7 
(n = 2 “5 ) only £09,(n) A-units are required in order to obtain a 
universal perceptron, in which a solution exists for all of the 2” possible classi- 
fication. This was demonstrated by explicit construction for 7 as 
large as 8. At some higher value of 7 , this ceases to be true, although 
the maximum 7” for which the observation holds true has not yet been 


determined. 


* 
S. Cameron, personal communication. 
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A lower bound for the number of A-units required for a 
universal perceptron in such a system has been obtained by Joseph (although 
it is not a least upper bound). The analysis (given in the Appendix of 
Ref. 41) is based on the Hay-Joseph theorem that the maximum number of 
orthants achievable by linear combinations of r vectors in /7 -space is 
approximately M(n,r) = ear where - is large, and 7” is small 
relative to  . An upper bound for the number of dichotomies achievable 
with N, A-units is found to be M(2"4,N,+1) M(2™,N,+2)...M(2%@, N,#Ng). 
It is shown that for large VV, the number of possible dichotomies is increasing 
at a much greater rate than the number of achievable dichotomies, so that 


there must be some point at which the system ceases to act as a universal 


perceptron. 
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18. @Q-FUNCTIONS FOR CROSS-COUPLED RERCEPTRONS 


A general cross-coupled perceptron is illustrated in Figure 47. 
It consists of three layers of units, with complete freedom of interconnection 
among the A-units. Due to the likelihood of closed circuits of connections 


within the network, this is called a closed-loop system. 


S-UNITS 


A-UNITS 


Figure 47 TYPICAL CONNECTIONS IN A CLOSED-LOOP CROSS-COUPLED PERCEPTRON 


In passing from open-loop to closed-loop networks, several 
fundamentally new considerations enter into the analysis. In the first 
place, the state of the network at time t¢t becomes a function, not only 
of the present sensory input and the momentary values of the connections, 


but of the preceding sequence of inputs and past activity states as well. 
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The dependence of the system's state upon time-sequences of previous 
states means that the transmission time, te , which previously 

played no part or only a minor part in the analysis of system performance, 
now becomes a parameter to be reckoned with at all times. The question 

of network stability is also a fundamental one; some cross-coupled 
networks, once triggered, will explode into total activity which prevents any 
further stimuli from making any impression at all, others will oscillate,and 
others will settle down to a stable steady-state condition. In this chapter, 
we begin by re-examining the concept of Q-functions, in order to provide a 
means of measuring the response of the network to sequences of stimuli, 
and comparing its response quantitatively for different stimulus sequences. 


These new Q-functions will be found to encompass the functions analyzed 


in Chapter 6 as a special case. 


18.1 Stimulus Sequences: Notation 


In Chapter 4, a stimulus was defined as any set of input signals 
to sensory units of a perceptron, excluding the null stimulus. In practice, 
these signals are generally taken to be 1 or zero. For present purposes, 
the null stimulus (all signals equal to zero) will be re-admitted as a stimulus, 
and will be symbolized by 0 when it occurs as part of a sequence. A 
stimulus sequence, id, - (S;, , Si, panes S;,,) can be an arbitrary series 
of stimuli which are assumed to occur at successive discrete times 
t,,t,+Ot, t,+2at,.--, t+ (m-1) At . An arbitrary set of stimulus 
sequences can be taken to comprise a stimulus-sequence world, for a 


given perceptron, in the sense of Definition 26 of Chapter 4. 
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In this and the subsequent chapters, it will be assumed that the 
transmission time, T7;; is equalto 4¢ for all connections, C;; , and 


this transmission time will be symbolized in abbreviated form by 7 


Consequently, if a stimulus S$; occurs attime ¢ , the response to this 


6 
stimulus in the A-system occurs attime ¢+7 , and @; is interpreted to 
mean the probability that an A-unit is activated attime ¢ if S; occurs at 
time “-7. Ina cross-coupled perceptron, however, @; is not a well- 
defined quantity, since in addition to signals from the retina, an A-unit may 
receive signals from other A-units attime ¢ , so that the response at time ft 
depends both on S(t -t%) and on the activity state of the association system 

at ¢-tT . @Q; is therefore redefined to apply to sequences ad; of length 

m , which begin at time ¢-mT , andterminate at t-7T , with the association 
system assumed to be totally inactive, or "silent" at time ¢-mT. In this 
case, for a sequence of length 1, @; is interpreted in the usual manner, 

and is represented by the equations of Chapter 6, without modification. Fora 
general sequence of length m , we use the notation Qin to designate the 
probability that an A-unit is active at time ¢ , given that the sequence ad; 
began attime ¢t-mT , sothat the m m member of the sequence occured at 
t¢-7t . More generally, we can write Q;,. to designate the probability 

that an A-unit is active attime f if the sequence db; began at {-r7, 
where # may be less than, equal, or greater than m . If rf is less than m, 


this is equivalent to the probability of response to a truncated sequence, 


containing only the first 7 stimuli of the sequence a; = (S;, , Si, reves Sinner Si 
If #>m, we adopt the convention that the sequence db is understood to 

have been augmented by the addition of #-m null stimuli, yielding the 

sequence (S, ; Se Mas ay S55 ; 0, eaten Ofc.) . In other words, it is assumed 


that the sequence a; began at f-#T , and that no other inputs occurred 
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). 


through time ¢-7T , the probability of A-unit activity then being determined 
fortime t . Ina simple perceptron, this probability would, of course, be 
zero for £>m ; ina cross-coupled system, however, the presence of 
persistent cycles of activity, or reverberating loops in the A-system, may 
maintain Oi, > 0 for an indefinite period. 

is redefined in a manner analogous to @: . Where af - 


Q . 


ty 
and J. are any two sequences, we define 


Ge. = probability that an A-unit responds at time ¢ if d- 


bu vp 
begins at ¢t-.? , and also responds at time ¢ 


if J begins at ¢- vr 


It is again assumed that the A-system is ''silent'' at the start of each sequence 


for which the Q-function is defined, and that if ~ or Y is greater than m 
the corresponding sequence is augmented by a sufficient number of null 
stimuli at the right-hand end. Q-functions with arbitrary numbers of sub- 
scripts canbe generated by an obvious extension of the above definition. 

In contexts where no ambiguity can arise, the notation Qi; will 


be used to denote @. . » i.e., the probability that an A-unit responds 
” 


Um’ 
immediately after the termination of ed. and also responds immediately 
after the termination of /. . Note that it is not required that the sequences 


de 


; and Jd. be commensurate, i.e., the lengths m and m’ may be different 


for the two sequences, without requiring any redefinition of Q@;; . 
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Generalization coefficients, Fins jy =o an be defined analogously 
to Q-functions. For example, in an alpha system, we would have E(G9; iy) Fin Jp? 
where indy is a measure of the increment added to the output signal of 
the A-set responding after » stimuli of the sequence af, » aBa result of 
an od -reinforcement after the in stimulus of the sequence J; . Again, 
if the second-order subscripts are suppressed, it will be assumed that 
9,, 79; ; , the effect of a reinforcement immediately after the termi- 
J mdm‘ 
nation of sf; 
of J. . If reinforcements are always applied and measured immediately 


upon the signal which follows immediately after the termination 


after the end of stimulus sequences, the performance of the perceptron in 
learning responses to such sequences can be derived from the resulting G 
matrix, in precisely the same manner as was done for elementary perceptrons 
in Part Il. Thus a knowledge of the Q-functions for a cross-coupled perceptron 
permits us to predict the performance of such systems in discrimination and 


generalization experiments. 


18.2 Q@; Functions and Stability 

The rigorous analysis of Q; for a cross-coupled perceptron 
with a finite number of A-units presents the identical difficulty which was 
encountered in the case of Q-functions for multi-layer systems (Section 15.1). 
The probability Q;, is, of course, identical to the function @, defined for 
the first stimulus of the sequence dj; in accordance with the equations of 
Chapter 6; but the probability Gi, already depends upon the distribution of 
numbers of A-units which respond to the first stimulus, 5S; . In order to 
avoid consideration of these distributions, the Q-functions obtained here will 
always represent limits for large networks, where it can be assumed that the 
actual proportion of A-units responding after Sin is equal to Qi, e at 
should be noted that due to the assumption that the sequence J, starts with a 


"silent" perceptron, Qi, = 0 
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A number of alternative topological models might be considered. 
For convenience, the following analysis takes up the case of a perceptron in 
which both the connections from the retina to the A-units and the "internal" 
connections to each A-unit are constrained as in the binomial model of 


Chapter 6. In this model, we have five parameters for each A-unit: 


G@ = threshold of A-unit 
Xs, = number of excitatory connections from the S-set, 
or retina 
y, = number of inhibitory connections from the retina 
x, = number of excitatory connections from other A-units 
.. number of inhibitory connections from other A-units 


In the present chapter, we shall be concerned only with perceptrons in which 
all input connections to A-units are fixed in value, regardless of where they 
Originate. Systems with modifiable couplings between A-units will be 
considered in the following chapter. It is assumed that each of the above 
sets of connections has its origin points assigned at random from a uniform 
probability distribution over the S-set or the A-set, as required. This 


results in the following equation for @Q,,, 


Sip = DL PEs) Palla) (Ea) Po(Za) 9 


E,-lit tale > oO 
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fraction of S-units activated by Shy 


Taking 0; = O, Qs. can thus be developed recursively in terms of Qi es 


up to any value of 2’ 


For a Poisson model, in which the number of output connections 
from each A-unit is constrained but the number of inputs is a random 
variable (or in which both ends of a set of connections are picked at random) 
equation (18.1) still applies, but the probability functions F,, P,,/P;, and Fy 
must be redefined, in a manner analogous to Chapter 6. It is also possible, 
of course, to have some kinds of connections (e.g., the internal excitatory . 
connections) distributed binomially, while the other sets of connections are 
organized according to a Poisson model, so that Praveen Fy need not all be 
of the same type. For present purposes, however, we shall continue to 
concentrate on the pure binomial model defined above. All major conclusions 


undoubtedly apply to Poisson and mixed systems equally well. 
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One of the first questions to be raised about such a system concerns 
the stability of the activity-level, and the possible tendency of the system to 
burst into total activity in response to a transient stimulus (which would, of 
course, preclude any possibility of learning or discrimination of different 
stimuli). Figure 48 illustrates the response to a transient stimulus (i.e., a 
sequence of length 1) for a number of representative cases. Figure 49 presents 
the response of a number of networks to a steadily maintained stimulus, or a 
sequence of stimuli all of which have the identical area. (Note that it follows 
from Equation (18.1) that the actual sequence of stimuli does not affect Q; ty 3 
so long as the stimulus area, Ria » is fixed for each Sin . Thus any two 


sequences for which the succession of Be are equivalent will yield the same 


value of Qs» -) 


Figure 48(a) illustrates the effect of the size of the "trigger sti- 
mulus" upon the transient response of the system. Note that the final activity 
level is independent of &- ; it is also independent of x, and y, , 80 long as 

¥, = 0 .. Figure 48(b) shows the effect of varying the ratio of internal 
excitation to internal inhibition (7, and y, ). Fora purely excitatory 
system, total activity of the network is likely to occur, in which all A-units 
become and remain active. As the inhibitory component is increased, a lower 
level of stable activity results, and with still further increase in y, relative 
to z, » the initial transient activity will die away entirely. Figure 48(c) 
shows that the effect of increasing the threshold of the A-units is similar to 
the effect of increasing the internal inhibitory component. It should be noted 
that all of these @- functions in response to transient phenomena in a cross- 
coupled system are identical to the succession of Q@ -functions for successive 
layers of a multilayer perceptron (as discussed in Chapter 15). For infinite \V, 
the equations for er and OD are identical, where ? in the first case 


denotes the layer, and in the second the cycle of activity in the A-system. 
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Figure 49(a) shows that as the internal inhibitory component is 
increased to the point where the terminal steady-state level of the system is 
below the value of Q; for the initial impulse from the retina, a damped 
series of oscillations occurs, which becomes pronounced as Yo is increased. 
Changing the threshold (as in Fig. 49(b)) also serves to reduce the asymptotic 
activity level, but does not cause the qualitative alteration from a monotonic to 
an oscillating sequence,as does the increase in y, . A sequence which is 
either monotone or oscillating for one value of 9 will remain monotone or 


oscillating as 6 is changed. 
18.3 Qi; Functions 


The function for a binomial-model cross-coupled per- 


Oi, vy 
ceptron can be calculated by an extension of the treatment employed in the 


pre ceding section. The resulting equation (again assuming large NV, _) is: 


(18.2) 


a PCE eh OC a) ee ee Be) 
9 
Q 


i 


Vv Iv 


where ow; = &, + €f-1 -1% +6, + €£-15-12 


ae SS GS ee ee ee 


R 
it 


The above notation for excitatory and inhibitory signal components received 
from the "unique" and "common" sets of sensory points and A-units active at 
t-7t is an obvious extension of the notation employed previously (c.f., 


Chapter 6). For the multinomial probabilities, we have 
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where C, = proportion of S-points activated both by Sin and S55 . 


Us = Rea C, where Ri is the proportion of S-points activated 


by Si, 


Uy = Ri C, where Ri, is the proportion of S-points activated 


by Sj 
Ca = 4 Spey 
Va = ae : ae Yp-s 
Ua = Cis Si wr does 
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For arbitrary values of 4 and Y , @ can again be calculated by 


ia. 
a recursive operation, assuming that the bearers is "silent" prior to the 
start of each sequence. If the two sequences »- and od are incommen- 
surate (or if ~ *# Y ) the values of C, are thus taken to be zero up to the 
time that both sequences have begun. (This is equivalent to extending the 
shorter sequence by adding a sufficia t number of null stimuli at the beginning 


to make it equal in length to the longer sequence. ) 


Two questions are of particular importance concerning these 
functions. The first is the question of the sensitivity of the system to 
pertubations in a sequence of stimuli; this determines how well a cross- 
coupled perceptron can discriminate one stimulus sequence from another. The 
second question is the dependence of the present state of the system upon 
stimuli from the remote past; this is of importance in order to guarantee a 
sufficiently consistent response to a present stimulus so that it can be 
correctly identified, and also in justifying an approximation to the perceptron's 
performance by means of an analysis of finite sequences (as will be done in the 
following chapter). Figures 50 and 51 present the results of an investigation of 

* 


these questions. 


In Figure 50 the effect of a perturbation in the stimulus sequence 
is illustrated. In each case the sequence df, is assumed to consist of 17 


stimuli ( A,, A,,---, Ay,z ). In the other sequences, one or more 


| 9 
"perturbation stimuli' are introduced in place of some of the "A" stimuli; 


* 
The data for these illustrations were computed by W. Eisner, on the 


Burroughs 220 computer at Cornell University. 
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these are denoted by the letter 'B" in the figure. In figure 50(a), a single 

"B" stimulus is introduced, in place of the eighth A stimulus, with C,, 

(the intersection between the "'B'' stimulus and the corresponding "A" 

stimulus, Ag ) being zero. We findthat with Q=2 , Qo is 

abruptly reduced as soon as the ''B" stimulus occurs, and then approaches a 
new asymptotic level, considerably below the Or, level. With a threshold of 
3, however, the curve following the perturbation returns to the @,, level, 80 
that three or four stimuli after the perturbation it is impossible to tell from the 
active A-set that the perturbation occurred. If the location of the ''B'' stimulus 
in the sequence is changed, the same type of @,, curve is found, with the 
deflection merely being displaced in time, but not changed in magnitude. 
Figure 49(b) shows that the same asymptotic level is approached regardless of 
the value of C,, , as long as the "A" and ".B" stimuli are not identical 

( C<.2 ). In general, it appears that the asymptotic value of @,, depends 
on the parameters of the network, but is independent of the magnitude of the 
perturbation. 


Figure 50(c) shows that as the internal inhibitory component is 
increased, the asymptotic value of Q, 2 &pproaches the asymptotic value of 
Q,, » in much the same manner as when the threshold is increased. 
Finally, Figure 50(d) illustrates the effect of increasing the duration of the 
perturbation up to four 'B"' stimuli. Note that the return curve following the 


perturbation is practically identical in all cases. 


Figure 51 demonstrates the effects of introducing null stimuli 
at the beginning of each stimulus sequence, in place of the initial "A" 
stimuli. The curves obtained are very similar to those obtained with a 
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perturbation of the od, sequence, and it is again found that by increasing the 
threshold or the value of y, the A-set responding to the altered sequence 


can be made to approach the set responding to the original, unaltered sequence. 


These results demonstrate that there are two distinct conditions 
which may be found in a cross-coupled perceptron, depending on the choice of 
parameters. With small 6 , or small values of y, , any perturbation or 
variation in the stimulus sequence will cause the system to follow a unique 
course for all subsequent time, and the A-set which is active at time ¢ 
depends on the entire sequence at all times prior to ¢ , rather than on the 
most recent stimuli. By increasing 9 or y, , however, such a perceptron 
can be converted into the second type, in which only the most recent stimuli 
appreciably affect the current state of the A-system, and stimuli which are 
sufficiently remote in time have a negligible effect. By lowering 9 or y, 
slightly, the duration of the noticeable aftereffects of a sequence perturbation 
can be increased, while still permitting an ultimate return to the A-states 
associated with the unperturbed sequence. This means, in effect, that the 
perceptron has a ''short term memory"! for sequences of a length commensurate 
with the time for the Q): : curve to return to its 'normal'' level, and such 
sequences can be discriminated by the system. In discriminating such 
sequences, the most recent stimuli will tend to dominate, and differences 
which occur in the remote past will be harder to recognize. With the first 
type of perceptron, however, which is obtained abruptly when the threshold 
becomes low enough (or y, becomes low enough) even the most remote 
stimuli have about the same effect as the most recent stimuli, and the 
current A-state gives relatively little information about what the present 
stimuli actually are. Thus, in order to guarantee an adequate degree of 
correlation between the activity state and the current stimuli, it is necessary 
to maintain thresholds or inhibitory components at a sufficiently high level; a 


perceptron of the first type is unlikely to be of much practical value. 
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19. ADAPTIVE PROCESSES IN CLOSED-LOOP CROSS -COUPLED PERCEPTRONS 


In Chapter 18, cross-coupled perceptrons with fixed connection 
networks were analyzed to determine their stability and characteristic 
responses to sequences of stimuli. In earlier chapters, four-layer and 
open-loop cross-coupled perceptrons were analyzed to show that an adaptive 
preterminal network could vastly improve the capabilities of such systems for 
similarity generalization. We now turn to the consideration of cross-coupled 
perceptrons with adaptive interconnections between the A-units, and will 
attempt to show that the same phenomena can be found here, in a more general 
and more efficient form, The oross-coupled system not only recognizes 
sequences of stimuli of arbitrary length, but tends to accellerate its adaptation 
process due to positive feedback effects within the system. It will be shown 
later that the closed-loop cross-coupled system is equivalent to an infinitely 


extended open-loop system, analogous to the one described in Chapter 17. 


The first attempt to demonstrate similarity generalization in 
cross-coupled systems was that of Rosenblatt, in Ref. 85. This was a 
partially analytic and partially heuristic argument, based upon a study of the 
similarities of origin-point configurations of the A-units under an arbitrary 
transformation. T. While the general predictions in this paper were correct, 
and have subsequently been demonstrated in simulation experiments, the 
method of analysis failed to yield quantitative predictions of the terminal 
state of the system, after a prolonged period of pre-conditioning. The 
method employed here is basically different, and yields a more general, as 
well as more accurate, result. Inthe following sections, the time-dependent 
evolution equations for the cross-coupled system will first be developed in 


their most general form, and specific applications will then be made to 
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systems in which the assumptions and initial conditions are simplified, to 
permit a more complete analysis. In the final sections, several similarity 
generalization experiments will be presented, and performance will be 


compared with that of multi-layer perceptrons. 


19.1 Postulated Organization and Dynamics 


The perceptrons to be analyzed in this chapter will be assumed, for 
convenience, to be fully cross-coupled, that is, there is a connection from 
every A-unit to every other A-unit and to itself as well. It can be shown that 
the conclusions which we shall reach for such a system can be extended to any 
perceptron for which the number of cross-coupling connections per A-unit is 


large, and the termini of the connections are assigned at random. 


Connections from S to A-units are assumed to be fixed in value, and 
connections from A to R-units are modifiable according to any of the usual 
reinforcement rules. (We shall not be concerned here with the reinforcement 
of A-R connections, but shall concentrate upon the evolution of the association 
network itself.) The A-units are assumed to be simple, with threshold 9 , 
and output signals a@*=/ or O . The transmission time for all connections 
is a constant t~ . Stimuli are assumed to occur at intervals of the transmissiom 


time, Tt 


Interconnections among A-units are assumed to be variable, 
according to the same rule employed for the four-layer system of Chapter 16; 
namely, if a; is active attime ¢ , and a; is active attime ¢+T, the 
value of the connection CT is increased by a quantity 7-At, and at the same 
time, all values v;; decay by the quantity fAt(v;;) . Thetime unit, 4¢ , 
will generally be considered large relative to 7 . In symbols, we have 
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(P-dr;;,) Ot if af (t-T)a;(t) =! 


Av: (t) = 
ner) a -PAt (v;,) otherwise (19.1) 


thus the total signal, o,;(t) , received by the A-unit a; attime t consists 
of a fixed-connection component, /4;(t) , originating from the retina, and a 
variable component, 7: (t) , coming from those A-units which were active at 


a ar 


19.2 The Phase Space of the A-units 


Let us suppose that the environment of a cross-coupled perceptron 
consists of exactly ” admissible stimulus sequences. In order to obtain a 
G-matrix for this perceptron, and predict its performance, it is necessary to 
know how its A-units will respond to each of the admissible sequences, inclu- 
ding the response to the lst, 2nd,...., m o member of the sequence. We 
will use the notation a; (S; - ) to denote the output signal of the unit a; 
following the pth stimulus of the sequence af . If the sequence as. begins 


at t-,77 , the stimulus S- 


aS will occur at ¢-7t , and the input to the unit 


a: 


; attime f¢ is given by 
(ip) (Jy) (Vp) 
oe !/? _ Be +2 P16) (19.2) 


where a iir is the sum of the signals received from the retina following 


(yp) 


the occurrence of S: and 7: ‘(t) is the sum of the signals received 


Up 
from other A-units attime ¢ , given that d. began at ¢-yt. Knowing 


oe (YP) , we can readily determine a: (5;,,) » since 


: 1 af of?) » 3 
Q; Ome, = 


O otherwise 
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In the perceptrons to be considered here, is a constant, while 


Be v) 


re) is a time-dependent variable (as in the four-layer perceptrons 


of Chapter 16). 


It will be convenient to represent each abbreviated sequence 
consisting of the first » members of any of the original n seqvences by 
a full sequence of length » . If m is the maximum sequence length, this 
results in a set of at most mn sequences. Let N be the number of such 
sequences, and let them be numbered from J, through dy . Then in 
terms of these new sequences, we can obtain all of the ai (S,,) - as (db) 
where Se is the sequence corresponding to the first » members of the 
Original sequence J; . The notation ai (dy) means the signal from a; 


following the last member of sequence de . Similarly, we have 
oe He MG). 

All of the information necessary to predict the response of an 
A-unit @; attime t can now be obtained from the 2N numbers 
(60), 48,..., AM, 0%), 0-1 73) = (4;, H:0)) . Thus the 
set of all possible signals (divided into retinal and internal components ) 
which might affect the activity of @; attime ¢  , can be represented 
by a vector of 2M components, which depends on ¢ . The space of all 
such vectors can be mapped into a Euclidean 2N -space, where each 
point represents a possible A-unit, or set of A-units, of the perceptron. 
This will be called the phase space of the A-units. For a large, or infinite 
perceptron, there is likely to be some concentration of A-units at each 
point in this phase space attime % . Thus, attime ¢t , there isa 
probability density associated with each point in the phase space. The state of 
the entire association system at a given time, ¢ , can then be represented by 


a probability density distribution over the phase space of the A-units. 
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For convenience of notation, parentheses for superscripts of 
o , 8 , and Z components will hereafter be omitted, with the understanding 
that the symbol 7. means the 7 -component for unit @; from stimulus 
sequence J, . If exponents are required, they will be expressed by the 
notation iar)" » which would be By to the 4" power. It should be 
remembered that with the symbols «~ , 4 , and 7 , subscripts always 


denote A-units, whereas superscripts indicate stimulus sequences. 
19.3 The Assumption of Finite Sequences 


In analyzing the performance of a perceptron, it will generally 
be our objective to predict the condition of the association system in the limit, 
as the length of the preconditioning sequence becomes infinite. This means 
that there are generally an infinite number of possible sequences in the 
environment, and the phase space of the A-units is properly represented by 
an infinite dimensional Euclidean space. To justify later assumptions, how- 
ever, it is necessary to assume that the preconditioning sequence is actually 
composed of a mixture of a finite number of subsequences of finite length. 
While this assumption will be carried through the analysis of the following 
section, it will be shown later that it is possible to drop the assumption in 


the case of periodic preconditioning sequences. 


Justification for an assumption of finite sequences can be found 
in one of two ways. First, we may assume that only the ™ stimuli prior to 
time ¢ can have any appreciable effect on the activity state of the A-system 
at time ¢ . In this case, we need consider only sequences of length ™ as 


possible determinants of a ae t) . Note that this assumption applies only to 
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the activity state of the system, and not to the values of the connections or 
memory state of the network, which clearly depends on all prior time. Such 
an assumption appears to be supported by the analyses of the last chapter, 
which show that for suitable parameters, only the most recent stimuli affect 
the activity state of the system at time ¢ , progressively more remote 

stimuli making a progressively smaller contribution, which soon becomes 
negligible. Specifically, it has been shown that with suitable parameters, it 
makes no significant difference to assume that the sequence began at time 
f-™vt, rather than at some earlier time, which is equivalent to the assumption 
of a finite universe of sequences of length ™ , in place of the universe of 


infinite sequences. 


An alternative approach, for which a rigorous analysis rather than 
a mere approximation is possible, is the following: Assume that the activity 
of the A-units is quenched" after every ™ stimuli; i.e., the perceptron is 
shown only sequences of length ™ , and at the end of each such sequence, its 
activity is interrupted by setting all a? =O , so that the next sequence begins 
with the perceptron in a ''silent'' state, as required. Let us analyze the 
performance of such a perceptron (for which the dimension of the phase space 
is finite) and then let ™ approach infinity. The limiting behavior of sucha 
system should correspond to a perceptron in which the sequences are uninter- 
rupted. For specificity, and to permit a rigorous analysis, this type of 
interrupted-activity system will be assumed in the following analysis, although 


it will be shown later that the results can be extended to a more general case. 
In keeping with the above assumption, it will be assumed that 
there are a total of NW possible subsequences which comprise the precondi- 


tioning sequence of the perceptron, symbolized J, ’ J, yeiey ad, . The 


phase space therefore has dimension 2N _ , and it is assumed that no stimulus 
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sequence (i.e., no subsequence) has more than ™ members (where ™ is 
finite). By selecting both 7 and d sufficiently small, it can be guaranteed 
that the change in the memory state of the perceptron during a single sequence 
of length ™ is negligible, or infinitesimal, so that the output signal a; (f,) 
depends only on oS, and the memory state of the system at the start of the 
sequence, and does not depend on changes in the memory state which occurred 


during the sequence Js itself. 


19.4 General Analysis: The Time -Dependent Equation 


Given the probability density over the phase space of the A-units 
attime ¢ , it is possible to obtain the Q-functions Qidp * Q;; for any 
pair of sequences (of length « and v , respectively) by integrating the 
probability density over the region of phase space for which a*(d;) a."(d-) = / 
That is, we integrate over the region for which on‘ 2 0 and a’ 2 9 
The subscript denoting particular A-units is suppressed here, since we are 
concerned only with the density of such A-units, and not with their individual 
identity. 


The object of a general analysis of the evolution of the association 
system in such a perceptron is to describe the ''flow" of A-units in this 
phase space, so as to obtain the density function at time ¢ as a function of 
the initial distribution and the stimulus sequences to which the perceptron 
has been exposed. The system can be represented by a sort of hydrodynamic 
model; the probability density in the phase space is treated as a sort of 
compressible fluid, in which convection phenomena occur, but in which 
there is no diffusion, since it will be seen that the A-units which initially 


occupy a given point in phase space will always move together, in unison, 
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rather than following unique paths. Throughout this analysis, it will be assumed 
that we are dealing with finite stimulus sequences (as described in Section 19.3), 
and that the rate of flow (the length of the velocity vector) for all points in 

the phase space is infinitesimal over the duration of the longest sequence. 

The history of the perceptron, then, consists of an endless sequence of such 
finite sub-sequences, so that at a given point in time, the perceptron can be 
assumed to be exposed to a mixture of all possible sequences, each weighted 
according to its probability. The velocity vector for a given point in phase- 
space at time ¢ then depends on the combination of velocity components 
contributed by each of the stimulus sequences to which the perceptron is 


exposed. 


We have seen that each A-unit, @; , is characterized by a set 
2 
of coordinates in phase space attime ¢ , namely (A: Pog yeees 3, 
f 2 N 
’ ae eee Ye 
time, while the 7 -components depend on ¢ . Thus, to follow the history 


For the given A-unit, the  -components are fixed for all 


of this A-unit (or point in phase space) we must determine the velocity 
vector 7; = (aH, x’, ore 7”) as a function of time for the 
point( 4;, 7; ). 


We consider first the effect of the reinforcement which occurs 
for the last stimulus in a sequence S, . upon the component 7 . Tobe 
specific, suppose sequence Sy occurs attime ¢ , and al, occurs at 
~2+d4t , and assume the transmission time T << At. Then the 
(infinitesimal) change in 7° due to having reinforced the last stimulus in 
sequence of, attime ¢ will be denoted by Ao (4:5 %: (t)) - Itisa 


function of the location of the point in phase space whose motion is being 
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traced, attime ¢ . Note that although only the effect due to the last 
stimulus of the sequence d, is considered, all abbreviated sequences are 
present among the NV possible sequences, so that if we know the effect of 
reinforcing the terminal stimulus in each case, the effect of all possible 


reinforcements can be calculated. 


A notation for the sequence corresponding to a, with its 
terminal member omitted (i.e., the sequence ss, abbreviated by one 
stimulus) will be required. We shall use the symbol J," to denote such 
an abbreviated sequence. The change in the memory state due to the last 
stimulus of sequence od, is then attributable to the modification of the values 
of those connections which originate in the set of A-units which respond to 
J, - and which terminate in the set of A-units responding to A, . From 


equation (19.1) we see that each such connection gains a quantity of value 


(7 a Srv) At, while all other connections lose a quantity - 72-4 


Figure 52 illustrates the relationship of the A-unit sets which 
are involved in this transaction, and shows the increments to 7” which 
result from the occurrence of wf, attime ¢ . The sets responding at 
time ¢ and ¢-7t are designated Ag (t) and Agi (t) , respectively. The 
set 4,.' (t+ At) is the set responding to the preterminal stimulus of 
sequence J. The measures of these sets are Qo (t), Qg’(t) and Q,/(t+At) . 
Since it was assumed that all A-units are interconnected, the measure of 
the set of connections for which 47~- =(9-dv)dt is Qq'(t) for 

a; €Ag(t) , and the measure of the set of connections for which 
Aw=-Cvit is 1{- Q9’ . a; ¢ Ag (t) , all of its input connections 


lose -d7dt _ . But we are particularly interested in the change in 7: 5 
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A-SETS RESPONDING TO A-SETS RESPONDING TO 
ABBREVIATED SEQUENCES FULL SEQUENCES 


Figure 52 EFFECT OF REINFORCING SEQUENCE 4, UPON a” 


which is the sum of the changes of value for all connections originating in 
the set 4,/(t + £t) , and terminating on the arbitrary unit a; , whose 


coordinates are (.4;, 7:) . These connections can be divided into 
three subsets: 


(1) Connections which originate from the intersection 


Ag: (t) DA, (t+Lt) and terminate in Ag ‘t) change by (7-07) At 
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(2) Connections which originate from the set 
A,-(t+At) - {Ag-(t) N Ap(t+At)} and terminate in A,(t) change by 
-dv Mt 


(3) All connections which originate from the set A,:(t + At) 
and terminate outside of Ag (t) change by -d7 At 


Now let us consider the difference equation 
Ad(A;» T(t) = 7 (t+At) - 7°(t) (19.3) 


for the A-unit a; whose location attime 7 is (4;, 7: (t)) . Since 
Ww = Dm Vz; » we can make the substitutions: 
a;€A,’ 


a(t) = >, v;; (t) 


a, €A,'(t) 
w"(t+ dt) = > , Uz(trdt) = > vj. (t) + » Av;;; 
a; €Ap* (t+ At) a, €A,*(t) a; €Ap: (t} 


+ , We (t+At) = >. Ue; (t+At) 
a;€{Apr(trdt)-Ap(t)} — a;efAp(t)-A,'(t+dt)} 
Making these substitutions yields: 


Ap(As T(t) = DD, or (t) + >, tv;;(t+At) (19.4) 
a; €A,«(t) a;EDA,’ 


where JA,’ = {Ap:(t+At) - A, (t)} +{Ap(t)-A,-'(t+dt)} , that is, the 
set of A-units added or subtracted from the set A,‘(t) during the period 


At . The first sum represents the change of value of the set of connections 
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which originate in A,‘(t) and are reinforced at time ¢ due to sequence 


A. 


above, and is given by 


This change in value is readily obtained from the components listed 


Nar Qe'r’ (t) = a)” Vj ()|at for ajEA, (t) 


-¢An’ 
D_ Suzi (t) = Bee 
QtA ‘() 
f ~SAt >. v3; (t) for a; ¢A,(t) 
a; €Ap'(t) 
which may be combined in the form 
dD, Sujit) = | Ma? Qgin Ct) (4+ 7;5Ct)) -07;"(t)] At (19.5) 


ajéAy'(t) 
where, as before, @(o)=1/ for w@2>O , and O otherwise, and 7° (t) 


has been substituted for De v;,(t) 
a; éAy'(t) 


The second sum in (19.4) represents the value of the set of 
connections which originate from the incremental set, 4A, . For this 
sum, it will be convenient to substitute the symbol 4, 7,’ (t) . Thus, 
(19.4) becomes 


Ay (4;.7; (2) = [Wa? Qgipkt) (67+ 7:4(e) - o9,"(e)| Ate, _ 6) 


where the subscript ¢ indicates that the subscripted variable is a component 


of the vector (4,9) for the unit a; 
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Now suppose each possible "conditioning sequence", af, 9 
occurs with a probability A » and that a statistically uniform mixture of 
all such sequences occurs attime ¢ . This supposition is justified by 
our assumption that the length of each sequence is infinitesimal, relative 
to the rate of change in the memory-state of the perceptron. In that case, 


we obtain from (19.6) 


(AB: F(t) = DP D5 (Bir Hilt) 
" (19. 7) 


=N,7 At ba PQgt¢: (t) b(37+ % ro) -SAt F(t) +07 (t) 


where A" 7: "(t) = value added or subtracted due to connections originating 
from the combined incremental set due to all oh, . If we now divide both 
sides by 4t andallow At _ to approach zero, we obtain the differential 
equation for the velocity component 7'(t) for the unit a; ; 


d *7 s(t) 


AT (t ( 
AF; (t) © nN,” D_? Qg'r' (t) o(43+ 7:7 (t)) 8 7-(t) + dt 
g 


os (19.8) 
ad "re ds Lim A* v(t) * 
dt At-+oO At ° 


Note that the quantity A" a: (t) is zero except at those times 
that new A-units are added to the set A,’‘(t) , since it represents the sum 
of the values in the incremental set 4A,’ . Again, we note that for 
sequences of length 2 or less, the set A, ‘(t) never changes, since new 


units can be added to the set only if ¢(& e) changes from 0 to 1, and for 


* Strictly speaking, this is either zero, or fails to exist. However, this 
expression will be restated below in terms of delta-functions (see 
Equation 19.9). 
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sequences of length 2, @ (oc) = ¢(8") , which is constant. Similarly, 
for sequences of length 2 or less, Qgir (t) is constant. Consequently, for 
these conditions, the equation (19.8) is equivalent to (16.11), except that 
Qg'r! takes the place of Q5'r - In the general case, however, °7,(t)/de 
_is not always zero; at those times that new A-uzits are added to the set 

A,: , an unknown increment to the value of 3%" occurs, which depends upon 
the values of the connections from those units whose at has just become 
equal to @ . This quantity is exceedingly difficult to calculate, as it depends 
upon detailed correlation of the (@ -vectors for the new transmitting units and 
the 4 -vector for the receiving unit, a; . Fortunately, it can be shown that 
the steady-state solution to (19.8) does not depend upon the actual value of the 


last term, even though it affects the rate of convergence to the steady-state 


condition. 


In the general case, the solution of (19.8) is discontinuous, unlike 
the solution of (16.11), which was always continuous despite its discontinuous 
, af 
derivative. From the above discussion as to the nature of a” z; (t) » it 


becomes clear that (19.8) can be rewritten in terms of Dirac delta-functions: 


w yey 
ATE (4) = Nal 2 Py Qarpilt) HLASe 7 Ae) - 07°C) Ziotet) Aaa) (19.9) 


where th is any time at which one or more of the p(x * ) changes from 0 


to ] £4or vice versa. 
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19.5 Steady State Solutions 


Consider the equilibrium equation corresponding to (19.9). If 
an equilibrium exists attime f¢ , then no p (c<) can change its value at 
time #¢ , and thus the last term of (19.9) is zero at this time. Thus, a 


steady state solution must correspond to a solution to the equation 


: 
LT; (od) “M049 ipi(t) (48+ :¥(00)) - 07300) = (19.10) 


which gives 


Tj; (00) = “at DP Qq'r (00) H(4F+ FF (co)) (19.11) 
rT 


or, substituting for Qe'r' ‘ 


“ ae at (19.12) 
Tic) = “20 Zz E Plate 1%) PT 9 (4+ Hu) $4 to) 


Note that the terminal vector (A, No) of an A-unit (in a given system) 
depends only on the starting vector (A, 7) so that we can also write in 


place of (19.12), 


Tim) = “2% >| $ (45+ r%eo) D P(E, I) $(3% 7 Yeo) $(5 “rf uae 


(4,%) 


where P(B, I) is the probability that an A-unit is initially situated at the 
point (A, 4%) in the phase space. Thus, inthis form, the steady-state 


solution requires no knowledge of the individual A-units and their connections, 
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but depends only on the initial point-mass distribution over the phase space. 
The corresponding time-dependent differential equation represents the 


velocity vector for an element of probability-mass in this phase space. 


Now a possible solution of (19.13) can be found by the following 
iterative procedure: Assume that initially, the values of all A-A connections 
are zero, sothat %~=0O for all units, and (19.13) depends only on the 
/3 -vectors. Begin by inserting 7,=0 forall 7's on the right-hand side 
of (19.13), and compute the resulting approximation for Tr ( B,.Vo) +» for 
all possible 6 -vectors (or for all units, a; ). The first approximation for 
Joo is then inserted on the right-hand side, to obtain the next approximation, 


etc. If we let %3) represent the result of the ] o iteration, we have 


(19.14) 


Bilge) = wot >. c o(A: + 7:t4)) rae P(B;) (4; +2, i ) o(Aafr 7, ap) 


9 


We will now attempt to show that this iteration must converge in a finite 
number of steps to the solution of the differential equation (19.9), for 


equivalent initial conditions. 


We first show that the iteration process itself converges in less 
than /“V,V_ steps (where ‘= the number of stimulus sequences, and NV, = 
the number of 4 -vectors for which (4) >O ). Onthe first iteration, it 
is clear that the 7 's can only increase, since they start out from zero, and 
are set equal to a non-negative quantity. But introducing this quantity for the 
next iteration can only increase the ¢ 's from zero to 1; it cannot cause any ¢ 


to decrease. Consequently, on the next iteration, the 7 's can again only 
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increase, and similarly for each subsequent iteration. Since as is non- 
decreasing, o('+ 7’) is non-decreasing, for all r . But 7” can 
change only when some @ changes, and each ¢ can change at most once 
(from 0 to 1). But there are at most NaN $-functions, b(a* ). If all of 
these are initially zero, the system is already at a solution, and no further 
changes will occur. Therefore, at most n < N,N ¢é-functions can change, 
and the process must converge in less than N, N_ iterations. 

Let the end result of this process be for anyunit @; . We 


# 
now wish to prove that 7 is a solution of the differential equation (19.9). 


« 
To begin with, we prove that rae is a minimal solution of the 


equilibrium equation (19.13). 


Let 7" be any solution of the equilibrium equation. Then for 
the iteration process, we have 7:(0) é 7 for all r and all A; 
Since the right-hand side of (19.13) is a monotone non-decreasing function 


of w » we have 


To - “aa 2. Z (A+ ay 204) #(6/+ Le) O(6 it fo J 


‘ 


SOD aaa a) Zope 2) age g)] = 7) 


IS 
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at 


Similarly, rae < 7, , and hence ti < 7" . Hence t" is 


a(n) 
minimal. 


Now consider the differential equation, (19.9. As long as no ¢ 


changes value, all d functions are zero, and (19.9) simplifies to 


SE = M1 DP wt dare yt) - 87 @) 
F 


= aq DA HP) FPL) OPC) 96 '€)- 8 170 
+ B; 


where oc; (t) = A; + v; (t) . Thus, while the ¢ 's are constant, the 
differential equation is of the form 24 =M-6O/7 , where 


M=N, 9 2 lp $ (oc?) ms P(A) blo! ) 6 (/")| 


Thus, during this time, there is an exponential approach to the limit M /6 ’ 
analogous to the solution discussed in Chapter 16 (pg. 355). Now suppose 
at time t one of the ¢'s changes. At this point, the last term in 
(19.9) is infinite, and the solution is discontinuous, since the value of the 
connections from the incremental set AA, has just been added to 7” 


Consequently, the solution takes the form shown in Figure 53. 
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Figure 53 FORM OF SOLUTION FOR CROSS-COUPLED ac-SYSTEMS 


M=No1 D>, i 6 (M6) DY PB) Od 7) $e), 


a & 
“feet aff + [BQO = 2 MEDD oar green 


The middle term of this expression represents the value of 7 ‘i at time 
t, - dt , just prior to the discontinuity. The magnitude of A“ %, 
remains unknown, but we know that it must be non-negative, since it 
consists of values of A-unit interconnections which began at zero and can 


only have changed in a positive direction. As in the case of the iterative 
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process, there are at most N,N times, t. , at which these discontinuities 


‘te 
can occur, and each new limit > > et . Moreover, the solution remains 


monotone increasing, despite its discontinuities. This last conclusion can be 
seen from the fact that the increment A* 7 comes from the values of a set 
of connections whose origins are now active for one more stimulus sequence 
than previously. Since no previously active A-units have become inactive (all 
¢ '‘s being monotone increasing) the values of these connections will not 
diminish, and will, in fact, tend to increase. Thus the new limit for 7; can 


be no lower than its present value. 


Now consider the first step of the iterative process. This yields 
r 
for Ta) 


differential equation. This means that if any g changes in the differential 


the value of the first asumptotic level, M,/8 » for all % in the 


equation prior to reaching the level M/6 , this ¢g must also change in the 
first step of the iterative process. (If no ¢ changes prior to the level M/6 
then no d will ever change, and we are at a solution for both equations. ). But 
the new level, M,/8 , is a positive monotonic function of the @¢ 's, and the 
next step of the iteration process, Vo) » corresponds to the level M, [5 
which would have resulted had every 7 actually attained its asymptotic level 
mM, /6 . Thus 710) > 7, (t,") for every r. But from the same argu- 
ment, it follows en Ve) > it} , and in general, ha) > 7,(ty,) 


Consequently, 7 = 7/ 


diay 2 and the solutions of the two equations are 


indeed identical. 


* It is assumed that M is not identically equal to © , in which case the 
solutions might coincide only for t = ow 
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19.6 Analysis of Finite -Sequence Environments 


The term "finite-sequence environment’ will be used for any 
system in which the stream of activity is periodically interrupted, either by 
actively setting all a; to zero, or by introducing sequences of null stimuli of 
sufficient duration to allow all A-unit activity to die out of its own accord. The lat- 
ter possibility exists only for systems in which the internal connection values 
are sufficiently small, or contain a sufficient inhibitory component, to guarantee 
that activy will, in fact, die away. Some idea of the conditions for this to occur 
may be gained from Section 18.2, and Figure 47. For convenience (and because 
it can always be realized, regardless of choice of parameters) the interrupted 
activity model will be considered here. In either case, finite-sequence 
environments are directly analyzeable by the method of Section 19.5. Several 
examples are given here, based on the same stimulus environment as in 
Experiment 12. It will be recalled that this consisted of four stimuli, with 
areas R= ,.2 , and intersections Ci, and C,,=./ » all other intersections 
being zero. As in the example in Chapter 17, we will consider a binomial 


perceptron with parameters #=3 , 4 = 0 ,and 6=2, for all A-units. 


EXAMPLE 1: Suppose the preconditioning sequence consists of an endless 
repetition of the subsequence: 5, S.5, Ss, / $,5, 5, 5, /5, $,5, 5, /.... » where 
the symbol / is used to indicate points at which activity is interrupted. Then 
for this environment there are actually four possible sequences to be considered 
in the analysis, namely 

#, =(S,) 

4, =(5,5,) 

Z, =(6,5,5,) 

4, =6,5,5, 5) 
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each occurring with probability A =.25 . The 4 -vectors for these four 
sequences correspond to the signals received from the terminal stimulus in 
each sequence, and are listed in Table 9, together with their probabilities. 
f -vectors consisting only of 1's and 0's represent A-units which will always 


remain inactive. 


The initial Q-matrix for this experiment is precisely the same 


as that found for the corresponding terminal stimuli in Chapter 17, namely, 


.104 .000 .034 .000 
.000 .104 .000 .034 
.034 .000 .104 .000 
.000 .034 .000 .104 


Qo) = 


It is found that no change occurs in this matrix for Non /8 Z1/7.6 . Letus 
therefore consider the case in which Me nfs =z /60 . In the open-loop system 


of Chapter 17, the sequence of Experiment 12 yielded the terminal Q-matrix: 


/-210 .176 .034 .072 
.176 .210 .072 .034 
Qo) = | 034 .072 .210 .176 

\‘o7z .034 .176 .210 


If we now compute the terminal matrix for a fully cross-coupled system, from 


Equation (19.14), we obtain: 


.104 .000 .034 .000 

_{ .000 .152 .000 .130 

(00) = .034 .000 .104 .000 
.000 .130 .000 .152 


-442- 


Google 


TABLE 9 
A VECTORS FOR STIMULI OF EXPERIMENT 12 


(Parameters of A-units: x = 2, y = 0) 


4 PA) A 
0000 064 3020 
000! .088 0012 
0010 048 0021 
0100 048 0120 
1000 048 0210 
0011 028 1200 
0110 02% 2100 
1001 024 1002 
1100 .02§ 2001 
O10! 072 0103 
1010 072 0301 
Olli .030 1030 
1001 -030 3010 
110) .030 0212 
1810 -030 2120 
one 086 0121 

1210 
0003 001 2021 
0030 00! 1202 
0300 001 1012 
3000 00) 2101 
0303 -001 1212 
3030 -001 2121 
0002 012 WIT 
0020 012 1121 
0200 012 1241 
2000 012 TIT 
0202 018 
2020 -018 pe 
0201 .027 1021 
0102 -027 2011 
2010 -027 1102 
1020 .027 1201 
0203 -003 1120 
ee soe 2110 
2030 -003 
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The only change which occurs in this case is that the set A, gains a larger 
intersection with the set A, . There is no tendency here for the A-sets 
responding to adjacent pairs of stimuli to merge, as would be the case in a 
four-layer model, or an open-loop cross-coupled network with zero transmission 


times. This is shown even more strikingly in the following example. 


EXAMPLE 2: For the same parameters as Example 1, let us extend the basic 


subsequence to 8 stimuli, using as the preconditioning sequence: 


5, 5.5.5,5,9;5:'5. /35:5 55555 Joos. 


2°38 4 3% 12 


The sequences for this environment are now 


Jd, = (S,) ide @(5,.955,5, 9:) 

d, = (5,5,) J, = 6,5,45,5.5, S,) 

P ee Conse, p25) 955,555; 5, 59 
4, = 6,5,5,5,) a, = 15,955,529, 555 5,) 


Each sequence occurs with probability Pe = .125 . The initial Q-matrix 


again depends only on the terminal stimuli, and takes the form: 


.104 .000 .104 .000 .034 .000 .034 .000 
.000 .104 .000 .104 .000 .034 .000 .034 
.104 .000 .104 .000 .034 .000 .034 .000 
.000 .104 .000 .104 .000 .034 .000 .034 
.034 .000 .034 .000 .104 .000 .104 .000 
.000 .034 .000 .034 .000 .104 .000 .104 
-.034 .000 .034 .000 .104 .000 .104 .000 
..000 .034 .000 .034 .000 .104 .000 .104 


Qo) 
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For the terminal matrix (again with N,n/d= /60 ) we now have 


.104 .000 .104 .000 .104 .000 .104 .000 
.000 .174 .000 .174 .000 .174 .000 .174 
-104 .000 .174 .000 .174 .000 .174 .000 
.000 .174 .000 .174 .000 .174 .000 .174 
.104 .000 .174 .000 .174 .000 .174 .000 
.000 .174 .000 .174 .000 .174 .000 .174 
.104 .000 .174 .000 .174 .000 .174 .000 
.000 .174 .000 .174 .000 .174 .000 .174 


This corresponds to an oscillating condition, in which each A-unit (after giving 
its original unaltered response to the first stimulus of the sequence) responds 
either 1,0,1,0, 1,0,lor 0,1, 0,1,0,1,0 to the remaining seven stimuli of 


the sequence. 


In contrast to previous models, there appears to be a failure to 
associate successive stimuli, and an association of every alternate stimulus 
instead. Actually, appearances are misleading here; a strong association of 
successive stimuli is masked by the appearance of these stimuli in the test 
sequence (which is identical, in this experiment, with the preconditioning 
sequence). In other words, the perceptron "'predicts'' the A-set for the next 
stimulus at precisely the time that this stimulus actually appears, and conse- 
quently the effect of the prediction is not detected. The following experiment 


reveals these "hidden associations" in a striking fashion. 


EXPERIMENT 13: Using the same four stimuli as in Experiment 12, the 
perceptron is shown the preconditioning sequence S,, 5,, 5,, 5, / 
S,, Sz, , 53, 54/..-- «It is then tested with the sequence 
S,, 0, 0,0..., and the Q-matrix for all subsequences (from 


both preconditioning and test sequences) is obtained. 
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If this experiment is performed with N, nl /s =100 , and all 
other parameters as before, it is found that on presenting the test sequence 
( S,»0)0)--.- ) the perceptron recap itulates the identical sequence of active 
sets A,, Az, A;)Aq which would have been activated had the preconditioning 
sequence occurred in full. After A, , the system lapses into inactivity, since 


* 
the preconditioning sequence is interrupted at this point. 


19.7 Analysis of Continuous Periodic Environments 


Up to this point, it has been assumed that the activity of the 
perceptron is interrupted at least once every »w stimuli. We now turn to the 
case of a continuous, unbroken sequence of stimuli, where the activity of the 
association system is allowed to run on without interruption. To begin with, 
the case of a periodic stimulus sequence will be considered, where the pre- 


conditioning sequence takes the form: 


CA ee eS a 


m eee#e¢ 


the period of the sequence being m . Such an environment can be considered 
as being composed of a set of ™ subsequences, each of length m+ / 


Specifically, we have the subsequences: 


J, - GiSs 5: oe Sa 21) 
Bie Op Sune Spi 5.5,) 


d= os ay 33 Ss ee Sm) 


¥e 
This "hallucinatory recall" effect, in which the perceptron, cued by the 


initial stimulus of the sequence, reproduces the identical sequence of internal 
states which would have been activated had the stimuli continued in their usual 
order, is suggestive of some of Penfield's observations on hallucinatory recall 
of stereotyped sequences induced by electrical stimulation of brain foci in 
epileptics (Ref. 68). 
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Each sequence occurs with probability !/m , and each sequence begins and 


ends with the same stimulus. 


Now since the preconditioning sequence is assumed to extend 
indefinitely into the past, at any arbitrary time t , the antecedent sequence for 
the first and last stimulus of any (m+/)-subsequence is the same; consequently 
7, = 7. for all € . But this means that there are, in fact, only a finite 
number (m7) of 7's for any A-unit, @, , so that the steady-state value of 
ty can be computed exactly by equation (19.14), where the sequence a is 
interpreted to mean the sequence J, in the set of m subsequences specified 


above. 


Several special cases are of particular interest. Consider first 
the case of a steadily maintained stimulus, (S, S,S, +6: ) . Substituting in 


(19.14), we have 
, N / f / 
_, 6) = ra . : 
Heh = 78 $(A'+ ty) ZAG) 1G; + Hay) 
; 
and it is readily seen that the set of active units can never change from the 


initial set, since this equation yields zero unless o(f8.) =O for the first 


iteration. Thus for a steady stimulus, we have 


Q:; (oo) = Qi, (0) 
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Next, consider the alternating sequence 5, 52 5,S,.... - In this 
case, (19.14) takes the form 


T0541) = “ot 2 (8; + Esty) 2 P(4,) 0(4; : Tt) o(4;+ rp) 
. 


+ OC Hey) 2908) (47+ Tan) OA hy ) 


In this case, if either (3; ) or o(8;) =/, ie will generally be non-zero, 
and the system will tend to form a union of the sets initially responding to 5, 
and S, (provided ©,,(0) #0 ). 


Finally, consider the stimulus sequence of Experiment 12, 
consisting of a period of alternation of S, and 5S, followed by an alternation 
of S, and S, , as described in Chapter 17. Rather than compute the 
entire 20 by 20 Q-matrix for Experiment 12, we present here a ''miniaturized 
version" of this experiment, based upon the eight-stimulus sequence 
employed in Example 2 of the preceding section. For the continuous environ- 


ment, the eight sequences will be: 
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fy (55 5) 5 793 55555 S95))) ly-= (SyS 455 S518; S, 5, 5,153) 
Bi. (SS 55555552555)" aly. S45 55 51555) 5 5 55,) 
Jd; = (S, 5, S354 53 56 5)5,5,) fy = (5,5, 5S, 525, Sy S5 Sq 53} 
ye. B55 2555 559,-S 5,5.) de® (59.55 5 56S.) 


It ia found that in this experiment, there is no choice of para- 
meters which will yield an increase in Qy2 » Qe > Q5, » and Qx7 without 
producing a corresponding increase in the set of A-units responding jointly to 
all stimulus sequences. It can also be shown that no matter how far the 
period of the preconditioning sequence is extended (by increasing the duration 
of S,5S, alternation and also increasing the duration of 5,5, alternation) 
the system will never be able to selectively combine the sets (A; Az) and 
(As ’ Ag) as in previous models. There is, nonetheless, a "predictive" effect 


which would be revealed if the stimuli were suddenly: cut off, as in Experi- 


ment 13. 


From this example (and those of the preceding section) it is 
clear that the condition for selective merging of A-sets for temporally 
adjacent stimuli is not as easily satisfied as in the four-layer system, or 
open-loop systems with zero transmission time. Experiment 14, however, 


illustrates a simple modification of the preconditioning sequence by which 


such a merger can be obtained. 
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EXPERIMENT 14: The same four stimuli are employed as in Experi- 
ments 12 and 13. The preconditioning sequence, however, takes 
the form: S,S, 5, S,5,5,5, 5,5, 5,53 53 54 Sg5555 5454555, , 
repeated ad infinitum. The terminal Q-matrix is obtained as 


before, for the twenty possible sequences of duration 21. 


In this case, it is found that there will be a tendency for the 
sets A, and A, to merge, and for the sets A, and A, to merge ina 
separate "'cell assembly".” What happens here is that the A-units responding 
to 5S, tend to be associated to the two most common successors of 5S, in 
the preconditioning sequence: namely, S, itself, and S, . Similarly, S, 
is associated both to S, and S, . Thus, when S, occurs at the start of 
the sequence it tends to be followed (coincident with its second appearance) 
by the combined set (A,,A,). When the first S, stimulus appears, A, 
combines with the "predicted" A, set, and the combined (A, , A,) set 
tends to persist until the first occurrence of 5S, , at which point it may 
combine with the new A, set, or may become inactive, depending upon the 
magnitude of Ne ifs . In order to prevent the original set from persisting 
indefinitely (since each A-set tends to predict itself, on the following cycle) 
Ne n /é must be kept small enough so that the 7 -components alone are 
insufficient to activate A-units whose B -components are zero. In this 
case, only part of the original A-sets will be activated in the absence of the 
actual stimulus, but a bias will still remain in thé diréction of the desired 


combination of A-sets. 


* The term ''cell assembly" seems appropriate here, as the sets which are 
formed in the terminal state of a cross-coupled perceptron bear a close 
resemblence in organization and functional properties to the cell assembly 
concept proposed by Hebb, in Ref. 33. 
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In general, if each stimulus which forms part of an "event" can 
occur with equal probability after any other stimulus in the same event, then 
all of the A-sets responding to these stimuli will tend to merge, at least in 
part, and will be evoked by any stimulus of the event-class. This is essen- 
tially the same effect which was found for four-layer perceptrons in 


Chapter 16. 


Actually, with the £4 -vectors corresponding to those in Table 9, 
(for A-units with only three retinal connections) the system is not well 
behaved in Experiment 14 regardless of the choice of threshold and N, 4 / 
With larger numbers of connections and the possibility of higher thresholds, 
however, it seems likely that the desired effect could be obtained with the 
preconditioning sequence given in the experiment. A 7 -perceptron (or a 
fl -perceptron) would probably be somewhat better behaved in this experi- 
ment, as it would tend to inhibit the sets of A-units characteristic of the 
first "event" once the second event began. Inthe o-system, there is a 
strong tendency for all A-sets to merge whenever N. n/s is sufficient to 


permit the merger of the desired sets, 


19.8 Analysis of Centinuous Aperiodic Environments 


If the preconditioning sequence is not periodic, some sort of 
approximation procedure must be used, if Equation (19.14) is to be applied. 
Two possibilities suggest themselves: First, the aperiodic sequence (if it 
is statistically uniform throughout) can be approximated by a periodic 
sequence if the period is sufficiently long to encompass all likely juxta- 


positions and short subsequences of stimuli. Second, we can consider all 
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subsequences of length ™ , assigning a probability to each, and analyze the 
system as though we were dealing with a finite-sequence environment, con- 
sisting of the various m -sequences in an appropriate frequency mixture. In 
this case, the analysis should converge to a correct solution as m becomes 
large, provided the original sequence is statistically uniform. If the statistical 
composition of the original preconditioning sequence changes over time, 
neither of these methods are applicable, and it seems likely that accurate 
solutions can then be obtained only by actually simulating the system and 


observing its behavior empirically. 


In the experiments which are of primary concern at this time, it 
is always possible to assume a statistically uniform preconditioning sequence, 
so that one of the two methods described above can be applied. In practice, 
this problem is likely to be soluble only for relatively small numbers of 
stimuli in the environment, as the Q-matrices rapidly become too large to 
handle in currently available digital computers. For long stimulus sequences 
and large numbers of stimuli, digital simulation remains the: preferred techni- 
que, and this offers the additional advantage of being applicable to small 
perceptrons or systems where the assumption of infinitesimal transmission 
time is inadmissible. In the preceding examples, where theoretical values 
(rather than empirical values) of Qi, were used, N, was implicitly taken 


to be very large. 
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19.9 Cross -Coupled Perceptrons with Value-Conservation 


The two types of value-conserving systems, 7 -systems and 
[ -systems, which were considered in section 16.6, are also of interest 
in cross-coupled systems. The /[’-system, which tends to strengthen 
connections to the Acest responding to the most likely wuccenedt of the 
present stimulus, while developing inhibitory connections to the A-units 
responding to unlikely successors, appears to be the more promising of the 
two. In most environments, however, both systems will probably show 
similar phenomena, provided transitions between stimuli tan occur symmetri- 
cally in either direction. The analysis of the 7-system, which is somewhat 


more familiar from previous work, will be considered first. 


19.9.1 Analysis of 7 -systems 


Inthe 7-perceptron, the total value of the set of input connections 
to each A-unit is conserved. Specifically, (assuming the system to be fully 


coupled) the change in the value of connection ay ; is given by 


AV,, = a; (t) later) D a, tt-r) “nat (19.15) 


Instead of (19.19), this leads to the differential equation: 


dg _ Z 
BE = MZ Ry -g09,) 09.16 


- 840) +) d(t-tf) ad (te) 
& 
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Since Qe - Q Qa may be negative, the former proof of convergence 
again breaks down, since 7” need not be monotonic. Ags in the case of the 
four-layer system, the approach will be to try to obtain a time-dependent 
solution for the 7's . The task is complicated in this case, however, by 
the presence of the unknown quantities A” %, (¢) in the equation, which 


we have not hitherto had to evaluate. 


For the 7 -system, any equilibrium equation must be of the 


= Net , : 
z a8 a [95 2, (19.17) 


anit ra se )ERE)h peep’) Ber) - IAA) 9!) 
_ Mx(A] 


~  § 


Where A= set of active A-unit sets, A; , for which the value of 7,'(o) 


eee 


is computed. As long as all ¢'s remain fixed, the 7's will tend 
exponentially towards such an equilibrium condition , as in previous models. 
Now consider the set of units whose d 's change value at time i. 

We wish to find the asymptotic value of the change in %, due to adding or 
subtracting this set of active units to the set A,, at time t. . This is 
equal to the difference between the asymptotic value of 7% based on the new 
set of active units A,. (t,) and the asymptotic value based on the old set 

of active units A_, (t,,) . Specifically, from (19.17), and with an 
obvious extension of previous notation, 
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a Aa", ) = iM eae -am ko Oe )) 
“SLA S ért,)( 2 °(4) eee a cies) 667.) 
2 i est )- 8 &; ec} x PU) bre” )) 


“304 HER) feres)-oeyee)]peres)-atc)) 9989 


With this equation for the asymptotic value of the "incremental 


set'' of A-units which become active (or inactive) at time t » it becomes 


possible to compute the time-dependent solution in much bea eoees manner as 
for the four-layer perceptron. To begin with, we obtain the functions E 
defined in equation 16.23) forall & , and thus determine the next oc, 

for which D(x,) will change. This gives us the values of $<; ty) 
which are required in equation (19.18). We then compute the actual value 
of a % (ts ) as follows. The contribution, a 7, , being composed 
of a number of individual values, wi , will approach its asymptotic value 
exponentially, with the same time-constant as the 7 's. Thus, if we can 
determine the value of the set of contributing connections at the start of the 
interval (time re ! ) we can determine its value at time t, . Now the 
value at to, is simply the sum of the anlt,.,) for all 4 such that blas” ) 
changes at ty . We will use the notation A, 7. (t,) for this starting 
value. Specifically, 


04916) = Ne DPA left) - 6606" )] ae 2) 


* To avoid computing wilt) » an approximation is required, e.g., 


(19.19f" 
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Then, by analogy to (16.24), we have 


@ PFy,% . a [a r ( ) 
A'y'(tt) = Maral. ntti ( 73 “A, 7, ( “) 
(19.20) 


6 


Thus, the complete solution for A at time t,” (including the discontinuity 


at the terminal end of the interval) is given by: 


rey MEO J+ MA" Z" AS) 
7%; (ty ) 7 . 5 - 


= Hi ty.,)-boritta)) (09-21) 


The value of the dicontinuity time, e. , is obtained as before, from 


_ alee Mbp. ai. MLA AS.) 


equation (16.25). 


This completes the analysis of the cross-coupled ~7-system. 
While no cases have actually been computed at the present time, it seems 
likely that this system will generally be better behaved than the oc -system, 
particularly in such problems as Experiment 14, where there is a tendency for 


all A-sets to merge under c-system dynamics. 


19.9.2 Analysis of -systems 


In the ['-system, where the value is conserved over the set of 
output connections from each A-unit, the change in the value of the connection 


“iz is now 


Aw; = a; (t -r) lee “FE ra az(t)| nAt (19.22) 
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This leads to the differential equation and equilibrium equation, respectively, 


dy" _ Fig) _ = r a Ae Ad ae 
ae ne D5 (t) Q(t) |}, 0%, t+ 2 ate t)°% 2) (19.23) 


N 
7," (@) = = 2 A He) : a] Qa (19.24) 


From these equations, a solution for 7 (ty) can clearly be computed 
along the same lines as in the previous section, for the ~7-system. 
Specifically, the asymptotic value for the connections from the difference set 


takes the form: 


Mp al Fe let) (2, 


DP) Oe) 9 Ger) -9 6; 2.) 


As 7," (t*) and /* % (ty) are computed by equations (19.19) and (19. 20) 


(19.25) 
without any modification, so that the final solution can be obtained as before | 
from Equation (19.21). 

Due to its apparent superiority as a predictive system, and since 
it appears to have the same advantages in stability of the A-set organization 
asthe 7-system, this model seems likely to be the most versatile system 
analyzed thus far. 

19.10 Similarity Generalization Experiments 
The consideration which first drew attention to the importance 


of cross-coupled perceptrons was the prediction by Rosenblatt (Ref. 85) that 


such networks would be capable of improving their performance in similarity 
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generalization, as a result of prolonged exposure to an environment in which 
stimuli are more likely to be succeeded by their transforms than by unrelated 
stimuli. In Chapter 16, it was shown that a suitably organized four-layer | 
perceptron has such a capability, and the above analysis shows that for 
sequences in which the activity of a cross-coupled perceptron is interrupted 
after every other stimulus, its performance should be equivalent to the four- 


layer model. Thus the original prediction appears to be upheld. 


The mathematical analysis of cross-coupled networks has been 
completed too recently to permit detailed examples of similarity generalization 
to be worked out at this time. A series of simulation experiments have been 
completed, however, employing a program written by Trevor Barker for the 
IBM 704. In this program a fully coupled network of 102 association units is 
represented, with 7 -system dynamics. The model differs from those 
analyzed above, in that the values do not decay. This leads to "instability" of 
the system (a tendency to go into terminal oscillatory modes with massive 
A-unit activity, unrelated to the stimuli which are presented), unless some 
additional measures are fakento limit the growth of the connection values. The 
program was therefore modified for bounded values. In order to prevent the 
tendency of the 7-system to turn off most of the initially responding A-units 
after the first few preconditioning stimuli, a further modification was 
included to permit half-integer values for 9 . Thus the values of the 
cross-coupling connections have no effect until the magnitude of 7 is at 


least equal to \/2 
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Even in this modified program, performance is considerably 
poorer than might be expected of the decaying value models, since the system 
ultimately goes to a saturation condition, with all values either at the upper 
or lower bound. Prior to this saturation state, however, (and to a lesser 
degree even in its saturated condition) similarity generalization can be 


successfully demonstrated, as in the following experiments. 


Figure 54 shows the results of two experiments, with five 
excitatory and five inhibitory retinal connections to each A-unit, 6 =/.5, 
n= 005 , and an upper bound of .2 forall values. In each case, the 
preconditioning sequence consisted of random stimuli, alternating with their 
transforms. Thetransform, T7(S) , consisted of a displacement of S 
by half the width of the retina. The retina itself was a 4 by 36 mosaic 
(144 points), and all stimuli covered one fourth of these points. In the first 
experiment, the preconditioning stimuli consisted of random "salt and pepper 
patterns", in which any combination of points is equally likely. In the second 
experiment, the stimuli were constructed by a ''blob generating program" which 
produces coherent, but randomly shaped patterns such as those illustrated in 
the figure. The test stimuli, in each case, consisted of the same set of ten 
coherent patterns (rectangular designs). After being exposed to the pre- 
conditioning sequence 5, T(S,), Ops Leer F 5; > TS), s+ +5 activity of 
the A-system is interrupted, and a G-matrix is computed for the twenty 


sequences: 


J, 5; S, TwW,) = TS,)T@,) 
J, ae) Téf,)= T@,) T6,) 


ut 
Y) 
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TYPICAL COHERENT STIMULI 
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Figure 54 CROSS-COUPLED PERCEPTRON SIMULATION EXPERIMENTS 


-460- 


(Go gle 


This G-matrix indicates which of the ten transforms would be identified 
correctly if the perceptron were trained to recognize their images, by méans 
of a single reinforcement. Sequences of duration 2 are used,to provide time 
for impulses to propagate over the cross-connections before testing the 


response. 


The curves show the mean performance of ten perceptrons over 
the set of ten test transforms, as a function of the number of preconditioning 
stimuli. In the case of the coherent stimuli, note that learning is both more 
rapid, and saturation is reached more quickly than with the yandom stimuli 
(where the saturation condition has not been reached even after 5000 pre- 
conditioning stimuli). While the peak performance level is less than .60, a 
statistical evaluation of the data reveals that the trend is definitely significant. 
All ten perceptrons, individually, showed a trend in the expected direction, 
so that the chance of obtaining these results accidentally would be less than 
-001. It should be noted that since the expected generalization coefficient, 
Fi4 , from a stimulus to its disjoint transform is negative (ina 7 -system) 
these perceptrons had to overcome an initial negative bias before achieving 


even the ''chance'' level of 50% correct identifications. 


These experiments confirm the predicted tendency of cross- 
coupled perceptrons to generalize on the basis of similarity, in a suitably 
organized environment. They also indicate the advantage of coherent over 
random stimuli, which is more pronounced in larger retinas than that 
illustrated. Doubling the number of retinal points would virtually eliminate 
the trend which is found for random stimuli, while the coherent stimulus 
curve would be relatively unaffected. All of these results are consistent 
with the laws of similarity generalization which were tentatively proposed in 


Section 15.4. 
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Until further empirical studies are completed, the theoretical 
results obtained for cross-coupled systems should still be interpreted with 
caution. There is at present no knowledge of the variance in performance 
over perceptrons, and how this relates to the size of the system; nor can 
we estimate the effects of finite stimulus sequences, in which the assumption 
of an infinitesimal rate of reinforcement per stimulus is not fully justified. 
The equations of the preceding sections represent limiting behavior for large 
values of N, , very gradual memory modification, and very long training 
sequences. The assumption of large WN, can be obviated by writing the 
equations with empirical § -vectors measured for a particular perceptron, 
but in this case the results can be generalized only by means of an empirical 
sampling procedure, with many such perceptrons. The given equa- 
tions will probably be found to yield correct qualitative results, but 


considerable work is still required to test their quantitative accuracy. 


19.11 Comparison of Cross -Coupled and Multi-Layer Systems 


In similarity generalization experiments, it has already been 
observed that there is a marked similarity between the performance of the 
four-layer perceptron of Chapter 16, the open-loop cross-coupled system 
of Chapter 17, and the closed-loop cross-coupled systems considered above. 
All of these systems are capable of learning to associate patterns which occur 
frequently in temporal succession, and abstracting the principle of simi- 
larity from a transformation sequence (in which stimuli alternate with their 
transforms). All of these systems will tend to work better with coherent 
patterns than with random point patterns. In all cases, the constant NL /s 
determines the nature of the terminal G-matrix which is obtained, for a 
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given environment. Actually, an exact equivalence is found between the 
performance of the fully cross-coupled system in finite-sequence environ- 
ments, with sequences of two stimuli, and the performance of the open-loop 
system of Chapter 17 with tz] . Suppose the system of Chapter 17 is 
extended to include an infinite number of A-sets, each with identical connec - 
tions from the retina, and with variable connections to each unit inthe 4&4 
A-set from each member of the £-/ A-set (and allowing unit time delay 

in transmission). It can then be shown that the states of the at A-set for 
the first £ stimuli in the sequence will correspond exactly to the states of 

the equivalent fully cross-coupled model (having all S-A connections equivalent 
to those in the open-loop model). Thus, the fully cross-coupled model, 
considered through all time, is equivalent to the output of an infinitely extended 
open-loop model, of the type discussed in Chapter 17. 


While these similarities would lead us to expect basically 
similar behavior in most problems for these different types of systems, some 
noteworthy differences do exist between the cross-coupled system and multi- 
layer systems with finite numbers of layers. First of all, there is an inherent 
sequence-dependence in the cross-coupled model, which makes its present 
state a function of the recent succession of events, (i.e., stimuli) rather 
than just the last event to occur. This means that all cross-coupled 
systems have some capability for temporal pattern recognition, even without 
variatian in the transmission times of the input connections. Secondly, the 
cross-coupled systems are likely to reach their terminal condition more 
rapidly, and with initially accelerating rates of adaptation, since the differ- 
ential equation depends on changes both in the transmitting and receiving 
sets of A-units, while in the four-layer model, the differential equation 
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depends only on changes in the receiving set, the transmitting set being fixed 
for alltime. The dependence on both receiving and transmitting sets makes 
the cross-coupled system more subject to "instability" phenomena, and 
probably tends to reduce the ''dynamic range'' of the system (as a function of 
Ne n / §& )in most cases. These phenomena have not yet been studied 
sufficiently to present conclusive quantitative results at this time. 


A more important difference than any of the above may be 
potentially present, although this remains in the realm of speculation at 
present. In a value-conserving cross-coupled perceptron, where there is 
the possibility of developing pronounced inhibitory interaction between A-sets, 
there is a tendency to develop ''cell assemblies'' (in Hebb's sense), and these 
cell-assemblies tend to rival one another for dominance at all times. It 
seems possible that such a phenomenon may provide a basis for figure- 
ground separation in complex sensory fields, where it is desired that the 
system attend to one object, or component of the input situation, and ignore 
the remainder. This will be discussed further in Part IV. If such an effect 
can be demonstrated, many of the remaining problems in the design of a 


perceiving system would be solved. 
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20. PERCEPTRONS WITH CROSS-COUPLED S AND R-SYSTEMS 


A number of interesting effects may be obtained by cross-coupling 
the S-units or R-units of a perceptron. Several such systems are considered 
briefly in this chapter. The first section deals with cross-coupled sensory 
systems; the second section deals with cross-coupled R-systems. Detailed 
analyses are not presented here, although several analytic studies are 


available in the referenced literature. 


20.1 Cross -coupled S-units 


If the sensory units are arranged in a two dimensional array, 
or retina, then it has been proposed that inhibitory interconnections between 
each S-unit and its nearest neighbors will tend to inhibit activity most 
strongly in the center of a field of illumination, and less around the edges. 
Such a system should lead to accentuated edges or boundaries for a visual 
pattern, reducing the relatively redundant information coming from interior 
regions. Systems utilizing this principle have been proposed by Taylor 
(Ref. 99), by Inselberg, Lifgren, and von Foerster (Ref. 4), and by a 
number of others. The Inselberg-Léfgren-von Foerster tréatment includes 
a more detailed quantitative analysis than was hitherto available, including 
cases in which the probability of interconnection of two units is a Gaussian 


function or an exponential function of the distance between them. 
While it appears that contour detectors can indeed be constructed 


by this means, it should be noted that some information is lost in the 
process: namely, the indication of the direction of the illumination gradient 


* See also Chapter 23, on visual analyzing mechanisms. 
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across the contour. Thus if a square patch of illumination is operated upon 
by the network to yield a square outline, there is no way to tell whether 

the inside of the square was light and the outside dark, or vice versa. The 
contour -detectors proposed by Rosenblatt in Ref. 79, which consist of 
A-units with circular or elliptical distributions of origin pointe, with 
slightly different centers for excitatory and inhibitory origin clusters, still 


* 
preserve this gradient information. 


A somewhat more interesting possibility has been demonstrated 
by Inselberg, et all, if three layers of units with anisotropic connections are 
superimposed on one another, with a rotation of the axes of symmetry by 60° 
in the successive layers. With such a system, it appears to be possible to 
construct a network from which there is zero output from a straight-line 
stimulus (regardless of its orientation) but a non-zero output from a curved 
line. Such systems clearly deserve more study as possible stimulus analyzing 


mechanisms for reducing the input data to a perceptron. 


Systems with excitatory interconnections between S-units are of 
relatively little interest, as such a network would generally lead only to a 
spread of activity from the stimulus region. The only useful function which 
such connections might have would be in smoothing irregular or broken 
images, by filling in holes and gaps; such an application, however, seems 


to be of questionable utility at the present time. 
20.2 Cross-coupled R-units 

Inhibitory interconnections between R-units may be useful in 
several ways. One application is to guarantee that no more than one R-unit 


can be "'on'' at any time. For this purpose, all R-units are given inhibitory 


* See also Hubel, Ref. 113, for relevant biological evidence. 
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interconnections to all the others; whichever unit first goes on, inhibits all 
the others, holding them off. Such a system will tend to "hang up" in this 
state, until the positive signal to the first R-unit is reversed, permitting 
some other unit to come on. If the speed of response of an R-unit is 
proportional to the magnitude of its input signal, such a scheme can be used 


to select the R-unit with maximum input from a given stimulus. 


In R-controlled reinforcement systems, inhibitory connections 
between R-units may sometimes be employed to guarantee that a unique 
response is associated to each new stimulus in succession. Suppose there 
are four stimuli, which activate disjoint or nearly disjoint sets of A-units. 


Let there be four R-units, with inhibitory connections as follows: 


Rr, ——+r, —— re, ——* F 


io ee 


In this scheme, unit R, inhibits (absolutely) all successive R-units 

Re.) ; Rie > « «© « ) . Now if stimulus §$, occurs, and transmits an 
initially positive signal to all R-units, only R, can goon. With an 
R-controlled value-conserving system (in which the sum of values over all 
connections is held constant) 5, will then develop an excitatory signal to 

R, » and negative signals to all other R-units. At the same time (since 

we have assumed essentially disjoint A-sets) the value-conserving system will 


guarantee that the R, response generalizes negatively to all other stimuli. 


} 
Thus, when 5S, occurs, it will tend to turn off R, » but will try to turn 
on R, »R, and R, . Of these, only R, can remain on, due to the 

inhibitory coupling, so that 5S, (or whichever stimulus occurs second in the 
sequence) will become associated to R, . Similarly, S$ is associated to 


3 
R, » and S, to R, 
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This scheme becomes somewhat less trivial if it is applied to 
the four-layer perceptrons of Chapter 16, subsequent to a preconditioning 
sequence in which the perceptron has learned to associate a unique A-set 
tp each similarity class of stimuli in a given environment. The aboye 
method can then be employed ta assign a unique response to each class of 
stimuli (provided the terminal A-sets have sufficiently small intérsections). 


While the interconnection schemes proposed here for S_ and 
R-units are occasionally useful for control purposes, they do not introduce 
any fundamentally new properties of importance. The most etriking pheno- 
mena to be found in crosg-coupled systems are the similarity generalizing 
capabilities of the cross-coupled association systems -- with the tantalizing 
possibility of a figure-ground mechanism still to be investigated in future 


work. 
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PART IV 


BACK-COUPLED PERCEPTRONS AND PROBLEMS 
FOR FUTURE STUDY 
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21. BACK COUPLED PERCEPTRONS AND SELECTIVE ATTENTION 


In Parts II and III of this volume, we have tried to establish the 
fundamental properties of two topological classes of perceptrons: series - 
coupled and cross-coupled systems. While the possible configurations of 
these two types of perceptrons have by no means been exhausted, the most 
general forms of series-coupled and cross-coupled networks appear to be 
sufficiently well understood so that their principles can now be applied to the 
analysis of more elaborate systems. The most general network is achieved 
with the addition of back-coupling (Definition 26, Chapter 4), so that layers 
of units which are relatively remote from the sensory end of the perceptron 
can modify the activity of layers which are relatively close to the sensory 
end. Given this additional mode of coupling, then virtually all perceptrons 
of interest, however elaborate their structure, can be regarded as compounds 


or modifications of the types previously considered. 


The modulating effect of back-coupling upon the behavior of a 
perceptron will be considered qualitatively in this chapter. It will be seen 
that while the analysis of such systems can frequently be carried out in terms 
of already established principles, their behavior possesses a new order of 
sophistication. In particular, the psychological phenomena of selective 
attention and ''cognitive set'' now begin to emerge. A related exposition of 


these ideas can be found in Rosenblatt, Ref. 79, Chapter X. 
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21.1 Three-Layer Systems With Fixed R-A Connections 


21.1.1 Single Modality Input Systems 


The first waberts be considered is the class of three-layer 
perceptrons having fixed-value connections from the R-units back to the 
A-units, For simplicity, it is assumed that there is no cross-coupling 
within any of the three layers. Such a perceptron with two R-units can be 


represented by the symbolic diagram: 


where solid arrows represent fixed-value connections, and broken lines 
represent variable-valued connections. In particular, assume that there is 

a connection from every R-unit back to every A-unit, half of these connections, 
chosen at random, having the value +l, and the other half having the value -!. 
In the following section it will be assumed that the R-units are of an "on-off" 
variety (having the outputs 1 or 0, rather than +] and -1) although analogous 
effects can be found for simple R-units. It is also assumed, for the sake of 
avoiding impossible closed-loop situations, that all connections have a short 
time delay, T ; a stimulus, however, is generally assumed to be held on 


the retina for atime T » 7 
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The signal ony which is fed back to an A-unit @, from the 


response unit ris given by the linear function 


a 


Thus ent is equal either to #,; or 0, depen ing on whether r= or 
QO . The effect of these feedback signals on the set of A-units responding to 
a given stimulus is shown in Figure 55. The symbol 4 is used to represent 
the component of the input signal, ao , which comes to the A-unit from the 
retina. It is assumed that there are two R-units, so that there are four 
disjoint sets of A-units with roughly N, /4 units in each set, corresponding to 
the four possible combinations of wri and vr, i « These sets of 
A-units are represented by the four quadrants of the diagram. The circles 
indicate the values of PB: received from the given stimulus, in relation 

to the threshold, 6; . The A-units in the innermost circle, for which 

fi 2 @+2 , will always be on when the given stimulus occurs, regardless 

of the condition of the R-units. Those units for which O64 8 < 6+ 2 will 

be on except when they receive an inhibitory signal from both R-units simul- 
taneously. The units for which f= 6-/ must receive a net excitatory 
signal from one or both of the R-units in order to go on, and those units for 
which = 9-2 will only go on (in the presence of the given stimulus) if 
they receive an excitatory feedback signal from both R-units at once. Units 
for which $4 09-2 will never respond to this stimulus. The magnitudes 

of these sets can be calculated from tables of Q-functions (c.f., Chapter 6 

and Reference 87). The shaded area in Figure 55 shows the sets which 


respond to the given stimulus when (r° : rs) =(I, 1) 
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NET FEEDBACK = 
+2 FROM (1,1) 
+1 FROM (1,0) 
+! FROM (0,1) 
0 FROM (0,0) 


WET FEEDBACK = 
0 FROM (1,1!) 

+1 FROM (1,0) 

-| FROM (0,1) 

0 FROM (0,0) 


ila an 


NET FEEDBACK = WET FEEDBACK = 


-2 FROM (1,1) O FROM (1,1) 
-1 FROM (1,0) -t FROM (1,0) 
-1 FROM (0,!) +1 FROM (0,1) 


0 FROM (0,0) 0 FROM (0,0) 


Vr, i Vr, é ee 
Versi ~ -| Vr t or +1 


Figure 55 EFFECT OF FEEDBACK ON ACTIVITY OF A-SET, IN RESPONSE TO A GIVEN 
STIMULUS, FOR PERCEPTRON WITH 2 R-UNITS. SHADING SHOWS ACTIVE 
A-SETS FOR THE RESPONSE STATE r,", r,° = (1,1). 
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Now suppose there are two stimuli, 5S, 


trained to give the response combination (rn, re) =(I,0) , while 5S, is 


associated to the response code (0,/) . We assume that the retinal sets 


and 5S, . 5S, is 


representing the two stimuli are completely disjoint. Having trained the 
perceptron, let us now present both stimuli simultaneously (i.e., a composite 
image, S, US, » is projected on the retina). Under these conditions, a 
series-coupled perceptron might equally well give the response combinations 
(0O,0),0©,/), (!, 0) or (1,1) . The present system, however, will 
tend to respond either with (/,0) or with (0,/) . In other words, it 

will tend to correlate those R-states which go with one of the two stimuli, 
rather than giving a partial response to each. This can be understood by 
reference to Figure 56, where the A-sets responding to each of the two 
stimuli are shown. For convenience, the sets responding to S$, are 


assumed to be disjoint from the sets responding to S$ », and the diagram 


2 
is simplified by assuming that the set which is active for the composite 5,S, 
stimulus (in the presence of a given R-state) is equal to the union of the sets 
responding to §, and 5, alone. This last assumption is not generally 
warranted, but the qualitative conclusions reached will still be correct. The 


shading shows the reinforced sets for 5, and 95, 


At the moment that 5, D9 appears on the retina, both R-units 
will be off, so that there is zero feedback to the A-system, and the total 
signal coming to each R-unit from the A-system will be approximately zero 
(consisting of a positive signal from one stimulus, and an approximately 
equal negative signal from the other stimulus). Suppose initially, both 
R-units go on. In this case, the sets of A-units responding when R* =(1,1) 


will become active, and the total signal to each R-unit will still be approxi- 
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Figure 56 A-SETS RESPONDING TO THE STIMULI S, AND S,, FOR THREE RESPONSE 
CONDITIONS. SHADED AREAS SHOW REINFORCED SETS, AND DOUBLE 
HATCHING SHOWS REINFORCEMENT WHICH GENERALIZES TO THE 
CONDITION R* = (1,0). 
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mately zero, so that the response state is unstable. Alternatively, suppose 
the R-state goes to (1,0) . In this case, the signal to the R-units comes 
from the double-hatched regions of the Venn-diagram in Figure 56, and the 
S, set becomes "dominant". If this occurs, the response (1,0) will 
tend to remain stable, and may even persist after the stimuli are removed 
(provided some of the A-units have thresholds £/ ). Similarly, if the 
R-state goes to (0,1) ,thenthe S, set becomes dominant, and its 


response will tend to persist. 


If either stimulus hag been trained to give the response (0,0) 
in the above experiment, the R-units will tend to "hang up" in their initial 
condition, and no other response can ever occur to the joint stimulus 5S, S, 
On the other hand, it is possible to produce an oscillating or cyclical response 
by training a given stimulus to give the response (1,1) when the present 
response is(0,0) , then conditioning the (1, 1) set to give the response 
(1, 0 e conditioning this set to give (0, 1) , and finally associating the 
response (0, 0 )to the A-set responding for (0, 1). Inthis case, as 
long as the stimulus is held on the retina, the R-units will cycle through the 


four responses in succession. 
The important tendency which has been demonstrated for this 
system is a tendency to correlate the output of the R-units so that they 


all apply to a single stimulus, when a composite stimulus occurs at the 


retina. This now provides the basis for the following experiment: 
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EXPERIMENT 15: Using a four-R-unit perceptron, and a universe of 
Squares and triangles of equal area in all positions on the retina, 
train the system to give the responses (re a )= (| ,0O) fora 
triangle, and (0,/) fora square; (ry, oe. (1,0) fora 
stimulus in the top half of the retina, and (0, |) for a stimulus 
in the bottom half. After training with an error-correction 
procedure, test the response of the perceptron to the stimuli 
S, = triangle in the top half of the field and square in the bottom 
half, and Sy = square in the top half with triangle in the bottom 
half. 


In this experiment, the first pair of responses are used for square/ 
triangle discrimination, and the second pair for top/bottom discrimination. 
For the time being, assume that the error correction procedure is modified 
by forcing the correct R*“ condition whenever a correction is applied. ( This 
assumption will be dropped in Section 21.2.) It is predicted that a back-coupled 
system, organized as above, will tend to give one of the two responses 
(1,0,1, 0) or (0, 1, 0, 1) for stimulus S, (signifying "triangle, top" 
or ''square, bottom",respectively), but will give one of the two responses 
(1,0,0, 1) or (0, 1,1,0) for stimulus Sy (signifying ''square, top" or 
"triangle, bottom"). In other words, the system should give a consistent 
description of one of the two stimuli, in terms of shape and location, and 
ignore the other stimulus; it will not name the shape of one and the position 
of the other, even though both shapes and both positions are simultaneously 


present. 
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That the predicted effects will tend to occur can be seen by 
referring to Figure 57, where it is assumed that the S, combination 
(top triangle and bottom square) occurs. Reinforcement is shown by cross- 
hatching. The relative sizes of the intersections in the Venn diagram are 
drawn to suggest the relative intersections of the A-sets for the response 
states of interest. Note that the set responding when R= (1,0,0.0) tends 
to have a relatively large intersection with the (!,0, 1,0) set, due to the 
fact that three of the four R-units are in identical states. The combined 
intersection of the Cis 0,0,0) set with the sets which are reinforced to 
is greater than the combined 


4 
intersection with the sets which were reinforced for the ''bottom" response. 


yield the "'top'' response (1,0) on Pr; and VF. 


If the triangle first becomes dominant with respect to the Yr, pair of 
responses (yielding the condition !,0,0,0) the activated set which has 

been most heavily reinforced, shown by cross-hatching, will now tend to 

evoke the "'top"’ response from r; and r, » since the "top triangle" set now 
carries considerably greater weight than the ''bottom square" set. Tkus a 
consistent configuration on all four R-units is induced. If (0, 1, O, 0) should 
occur, however, the system will have an opposite bias for rs; and ", , tending 
to evoke the condition (0, /, 0, ‘) . If Sy should occur instead of 5, , the 
biases will be found to favor the (/,0,9,/) or (0,/,/,0) conditions, as 


predicted. 


Experiment 15 illustrates the simplest conditions under which 
"selective attention" might be said to occur in a perceptron. In a complex 
field, with more than one trained stimulus present, rather than giving a 
conflicting mixture of responses, the perceptron tends to pick a single 


familiar "object'' and respond to this object to the exclusion of everything 
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Top A 


TOP C) 


B0T. 


BOT. A 


Figure 57 SETS AFFECTING THE TRANSITION FROM THE RESPONSE STATE (1,0,0,0) 
WHEN THE COMBINED STIMULUS "TOP TRIANGLE” AND "BOTTOM SQUARE” OCCURS. 
SHADING SHOWS REINFORCED SETS, AND THE MEASURES OF THE INTERSECTIONS 
WITH THE (1,0,0,0) SETS ARE DENOTED BY THE LETTERS a, b, c, AND d. 
THE VENN DIAGRAM 1S DRAWN SO AS TO EMPHASIZE THE PROBABLE MAGNITUDES 
INVOLVED. 
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else. By adding additional responses, a complete description might be 
obtained of the shape, size, position, etc., of a single object in the field. 
The particular object which is selected, however, depends on chance factors, 
such as the relative amounts of reinforcement which have been applied to 
different A-sets, or momentary noise within the network. In the following 
section, it will be shown how a stimulus in a different modality, such as a 
spoken word, can be made to direct the attention of the perceptron towards a 


selected object or region in the visual field. 


21.1.2 Dual Modality Input Systems 


The perceptron which is illustrated in Figure 58 is similar 
to the one which was described in the preceding section, except that it 
possesses two sensory input systems, one visual (a retina) and the other 
auditory (e.g., a filter system). There is a set of A-units for each of these 
input sets, designated A, for the visual association system, and A, for 
the auditory association system. Again, there are four R-units, each one 
receiving variable-valued connections from all A-units in both sets, and 
sending a set of fixed value connections back to all the A-units. As before, 
half of the feedback connections from each R-unit are assumed to be excitatory, 


and the remainder inhibitory, with values * | 
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PHOTO- 
SENSITIVE 
S-UNITS 


Figure 58 ORGANIZATION OF A DUAL NODALITY PERCEPTRON, WITH 4 R-UNITS 
(BROKEN LINES INDICATE VARIABLE-VALUED CONNECTIONS) 


With this system, the following experiment can be performed: 


EXPERIMENT 16: Using a dual-modality input system (visual and 
auditory), with four R-units, train the perceptron to distinguish 
square/triangle and top/bottom, using the same code and 
stimuli as in Experiment 15. Then, selecting four discriminable 
audio-patterns, SQ, TR, T, andB, train the perceptron by 


means of the audio-input to associate the responses for ''square", 
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"triangle'', "top"! and ''bottom" to these four stimuli. In testing 
the perceptron, a composite visual stimulus, consisting of a 
triangle in the top half of the field and a square in the bottom 
half, is used. Simultaneously with the visual input, the audio- 
pattern SQ, TR, T, or B is presented, and the response of the 


perceptron is observed for each of these four conditions. 


From the discussion of Experiment 15, it is clear that the 
visual section of the perceptron will tend to give a consistent response of 
(1,0,1,0) or (0,1,0,1) , representing ''top triangle" or "bottom square", 
respectively. The effect of adding the audio-stimuli is to add an additional 
bias to the R-units, favoring one of the four ''concepts", square, triangle, top, 
or bottom. For example, if the TR stimulus is applied (which has been 
independently associated to the composite response rs r. =1/,0 ) there 
will be an auxiliary positive signal to f, , and an inhibitory signalto rr, , 
coming from the A, set. There will be no bias introduced on f, and r, 
Consequently, the system will be biased to give the initial response 
(1,0,0,0) ., which we have seen tends to transform itself into the stable 


condition (1, 0,1, 0) for the given stimulus. 


Thus the results which are predicted for Experiment 16 are that 
when the audio-pattern TR is given, the perceptron will give the composite 
response indicating the shape and position of the triangle; when SQ is 
presented, the perceptron will indicate the shape and position of the square; 
for the audio-input fT , it will indicate the shape and location of the top 
visual pattern; and for B , it will indicate the shape and location of the 


bottom pattern. An audio-command can therefore be used to direct the 
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attention of the visual system to a specified location or a specified shape, 
and the output of the perceptron will be a consistent description of the indi- 


cated object. 


While it is possible by means of the above procedure to assign 
"names'' to visual objects or events, and direct the attention of the perceptron 
by means of these names, it should be noted that the association is actually 
much too complete for this to serve as a model for linguistic 'naming behavior”. 
For the perceptron, there is no difference (at the response level) between the 
name for an object and the object itself. Thus the audio-symbol TR and the 
visual image of a triangle both turn on the same response combination 
(1,0,..) inthe experiment considered above. If it is desired to retrain 
the system to associate some other visual pattern (say, ''trapezoid"') with 
the TR symbol, it is necessary to completely eliminate the previous asso- 
ciation of triangles to (1,0,..) and train trapezoids to give this response 
instead. Words and visual patterns are part of the same conceptual class, for 
this perceptron, and cannot be re-associated as distinct entities, but can only 
be used as raw material for building up new conceptual classes. The distinction 
between the name and the visual object becomes important in practice if we 
wish to tell the perceptron to "look for the square" when there is no visual 
Square present. The audio-symbol ''look'' might be used to start an auto- 
matic scan or hunting process, but to stop the process when a square is 
found, the perceptron must be capable of distinguishing between the audio - 
symbol for "square'' (which it must remember for the duration of the search 
process to tell it what it is looking for) and the visual pattern of a "square", 
which must stop the search when it appears. A perceptron which is capable 
of distinguishing between symbols and objects, and is not subject to these 


criticisms, will be considered in Section 21.3. 
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21.2 Three-Layer Systems With Variable R-A Connections 


In the previous examples, the existence of a bias towards one 
of the two consistent response configurations when part of the 4’ ” state is 
achieved, is due to the fact that reinforcement is applied only in the presence 
of the correct response. This means that whenever a corrective reinforce- 
ment is applied, the reinforcement control system must first "force" the 
desired response configuration. But in a simple error-correction procedure, 
as this concept has been used previously, the corrective reinforcement would 
normally be applied only when the response is wrong, and this would tend to 
reduce the indicated bias quite drastically. For example, in Figure 56, it 
can be seen that if $, had been negatively reinforced in the presence of the 
R” = (1,0) state, this negative reinforcement would tend to cancel the effect 
of the S, signal. One method of eliminating this problem, which leads to a 


system which appears to be generally better-behaved (on the basis of a quali- 


tative examination of its properties) is to make use of adaptive back-connections, 


rather than fixed-value connections, from the R_ to A-units. 


21.2.1 Fixed Threshold Systems 


The first model to be considered corresponds topologically to the 
model treated in Section 21.1.1, but differs in having variable connections, 


so that its symbolic diagram is of the form: 
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The forward connections, from A to R-units, are assumed to follow the 

usual qo -system dynamics, subject to error-correction procedures. The 
back-connections, however, are subject to the [ -system rule which was 
introduced for cross-coupled perceptrons. This means that the total value 

of the set of feedback connections from each R-unit remains constant, but 

that if both termini (the R-unit and the A-unit) are active in succession, the 
connection value is incremented by a positive quantity, 1 . At the same 
time, a proportional decay occurs in all active R-A connections, so that in 
the absence of reinforcements, they tend to approach zero exponentially. The 


* 
net change in value of connection <,, attime ¢ is therefore 


w,, (t) = r* (t-7) a," (2) - Ny rae a, y(t) - 8a, -1 (t)| 


(21.1) 


Assuming, as before, that each stimulus persists for atime T >>T  , the 
result of this rule is to raise the value of the feedback signal to all S-units 


which respond to the current stimulus, from the active R-units, and at the 


* Note that in this equation decay occurs only when ,’= / . This 
means that the feedback signals from different R-units will have 
approximately equal weight, regardless of the relative frequency 
with which the R-units are used. The transmission delay, T , 
is included only for conformity to previous models, and plays no 
essential role here. 
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same time to develop inhibitory connections to the A-units which are not 
currently active. The decay guarantees that the entire system will tend 
towards a dynamic equilibrium, at which the expected rate of gain just 


balances the rate of decay. 


The effect of this system is illustrated in Figure 59, which shows 
the condition after associating stimulus S, tothe response (1,0) and S, 
to the response (0,1), by an error correction procedure. This corresponds 
to the same conditions as Figure 56. The sets which respond when 
R" = (0 , 0) are shown by the large circles. If these sets are initially 
reinforced to yield the appropriate response for each stimulus, then when the 
composite stimulus appears, they will try to turn on opposite responses, with 
about equal strength. Such a condition, however, will be an unstable one. If 
one of the sets, say S, » Carries slightly greater weight than the other, 
the condition illustrated in the figure will arise. With *, on, excitatory 


Figure 59 A-SETS RESPONDING TO THE COMPOSITE STIMULUS S,S,. SHADING 
SHOWS ACTIVE A-SETS FOR THE RESPONSE STATE (1,0). 
(COMPARE Figure 56). 
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signals will be transmitted back tothe 5S, set, and inhibitory signals to 
all other A-units, including the 5S, set. Thusthe 5S, set remains 
unchanged, but the S, set is diminished. Alternatively, if 5, should 
gain an advantage, the S, set will tend to remain unchanged, andthe 5, 


set will be reduced. 


If we assume that the universe consists of a large number of 
stimuli in each class, as in Experiments 15 and 16, the set of A-units 
responding to 5, would generally not be perfectly preserved, but would 
be shifted to include more units which respond to many stimuli in the S, 
class, and to eliminate those units which respond only to S, . Thus 
there is an additional tendency, in this system, to convert the sets of 
A-units for different stimuli which have been associated to the same response, 
to sets which are nearly identical. It is clear that if the procedures of 
Experiments 15 and 16 are carried out with this system (but with the usual 
error-correction practice of reinforcing in the presence of the wrong 
responses only, rather than forcing the correct response) the results predicted 
in Section 21.1 will be obtained, but with less chance of confusion or 
erroneous bias due to conflicting active sets. The special property of the 
variable feedback system can be characterized as a tendency to activate the 
A-units responding to one of the previously trained parts of a complex 
stimulus, while suppressing those A-units which respond to the remaining 


parts. 
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21.2.2 Servo-Contralled Threshold Systems 


In all perceptrons considered thus far, the thresholds of the 
A-units have been assumed to be invariant over time. It is possible to vary 
the effective threshold of an A-unit by adding an excitatory or inhibitory 
component to its input signal. If this is done for all A-units in the system, 
the result will be to increase or decrease the proportion of units which 
respond to a given stimulus. If all signals and thresholds are quantized, then 
the change in the active set will occur by sudden jumps; for example, the 
addition of AQ@Q@= +1! will suddently activate all A-units whose © -signal 
was equalto 6, 


¢ 
for the control of activity. On the other hand, if each A-unit has a threshold 


-/! . Such a condition would be hard to utilize effectively 


0, selected at random from some continuous distribution, say a Gaussian 
distribution, then there will always be some A-units whose thresholds 0; 
are just below the present value of &; , and others whose thresholds are 
just above the present value of ©, . In this case, a slight change in 6 
will always yield a corresponding change in the size of the active A-set, and 
the. size of the active set will vary in an approximately continuous fashion 


as @ is changed continuously. 


Figure 60 shows a back-coupled perceptron in which the amount 
of activity is continuously monitored by a servomechanism, which controls 
the magnitude of the thresholds so as to keep the total activity constant. 

If the fraction of active units falls below the desired level, the servo-system 
transmits an excitatory signal to all A-units (equivalent to 46 < 0 ) while 
if the activity rises above the desired level, an inhibitory signal (equivalent 
to A@ >O )is transmitted to all A-units. 


-489- 


Google 


@ -SERVO 


Figure 60 BACK-COUPLED PERCEPTRON WITH SERVO-CONTROLLED THRESHOLDS. 


Such a system is likely to have advantages in many types of 

perceptrons. Attached to a series-coupled perceptron, for example, the 

@ -servo can guarantee that regardless of stimulus size or intensity, the 
level of A-unit activity will be optimum. In a cross-coupled system, it can 
be used to prevent "blow-ups'"' of activity, by providing an active mechanism 
for counterbalancing the growth of excitatory weights. It is worth noting 
that the 6 -servo can substitute for inhibitory connections from the retina 
to A-units, since it generally yields the condition that if stimulus S, is 
a subset of stimulus S, (on the retina), the corresponding active asso- 


ciation set A (Sx) will not be a subset of A (sy) . In the back-coupled 
system, the 6 -servo yields particularly interesting results. 


Figure 61(a) shows the condition of the A-set for the same stimuli 
as in Figure 59, with the R-units in the (0,0) state, so that there is no feed- 


back. The large circles show the sets which respondto S, and S, alone, 
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normalized by the action of the servomechanism. When the composite 
stimulus appears, it is no longer possible for the union of the sets A(S,) 
and A(S,) to remain active, however; consequently the active sets 

are reduced to those units (shown by the shaded areas of the diagram) for 
which 4, = 6,+4@. Under these conditions there is still no bias 
favoring the S, response or the 5S, response; both sets are still in 
balance, and either response might occur. As before, however, this condi- 
tion tends to be unstable, and (assuming that , S, and S, have been 
associated to the same response codes as previously) either (1,0) or 


(0,1) will tend to occur. 


Figure 61(b) shows the stable state of the system in which the 
response (1,0) has become dominant. The servo-system is now obliged to 
adjust to the effect of the excitatory signal fed back tothe A(S,) set, and 
the inhibitory signal to the A(S,) set. The result is that the active set is 
nearly identical to the set which would be active for 5S, alone, the A(S,) 
set being virtually obliterated by the combined effect of the negative 
feedback and the increased threshold. It seems likely that by strengthen- 
ing the excitatory feedback component ( a, in the diagram) sufficiently, 
the active set can be made to coincide perfectly with the set responding to 
S, alone. Thus the effect of selecting the (1,0) response configuration is to 


enable the perceptron to respond exclusively tothe S, stimulus,completely 
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(a) ACTIVITY STATE FOR R*=(0,0) 


@;48;<9,+ Ad 


So a, 


(b) ACTIVITY STATE FOR R*=(1,0) 


s| 


$ Ro (OFF) 
A; 2 6; + A@6 - V, 


Figure 61 ACTIVE A-SETS FOR COMPOSITE S,S, STIMULUS, IN SERVO-CONTROLLED 


BACK-COUPLED SYSTEM. ACTIVE SETS SHOWN BY SHADED AREAS. 
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free from interference by the presence of 52 . Reversal of the R* state 
would, of course, lead to a reversal of the A-state. These phenomena are 
highly suggestive of reversible perspective and figure-ground reversal in 
psychological experiments, where one of two ways of perceiving a complex 


figure dominates to the exclusion of the other. 


In a dual-modality perceptron, the above system will work in a 
similar fashion, assuming that separate Q-servos are employed for the 
visual and auditory channels. Thus by giving the audio symbol for square 
or triangle, top or bottom, in Experiment 16, the perceptron can be directed 
to attend to one of the two objects present, and will develop an A-unit state 
which corresponds closely to the state which would be expected if only the 
indicated object was present in the field. 


21.3 Linguistic Concept Association in a Four -Layer Perceptron 


In Section 21.1.2, it was noted that although names can be 
associated to objects or visual events in a three-layer back-coupled model, 
so as to permit the experimenter to direct the attention of the perceptron 
selectively to a named object in a compound field of stimuli, the associations 
formed tend to be associations of particular stimuli, rather than universals. 
It is not possible to change the name of an object (or a class of objects) 
without actually undoing the previous perceptual organization of the stimulus 
world for the given perceptron, and then reconstructing it in a new form. 
Words and visual patterns are not distinguished, at the response level, but 


are amalgamiated into a cornmmon concept. 
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A perceptron which is capable of first forming auditory and 
visual concepts, or universals, and then associating these with one another, 
and which can change its "linguistic associations" without disrupting its 
perceptual organization, is illustrated in Figure 62. The system has a 
visual input and an audio-input, as in Figure 58. It is also equipped with 
a 6 -servo, and the back connections to the Aw set are variable, as in 
Section 21.2. For present purposes no back-connections to the A, set are 
required. There are two distinct sets of R-units: one set, R~ » receives 
its primary inputs from the A, system, and can be associated to visual 
stimuli. The second set, RR“ , receives its primary inputs from the audio- 
system, and can be trained to respond to sound patterns, or words. (By using 
a spectrum of 7; ; for the S, to A, connections, or by means of a 
cross-coupled A, -set, the system can be taught to recognize sound 


sequences, so that it need not be restricted to momentary sound patterns. ) 


@ -SERVO 
a ieee 
——— 


Figure 62 A DUAL-MODALITY PERCEPTRON FOR LINGUISTIC CONCEPT ASSOCIATION. 
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Thus far, we have what amounts to two mutually independent 
perceptrons, one for visual stimuli, and the other for auditory stimuli. 
Each of these perceptrons can form classes and generalizations by means of 
an error-correction procedure applied to the appropriate response sets. 
The added feature, however, is the extra association layer, which, in this 
system, comes after the R-units. The A-units in this set receive fixed 
connections from the R-units (which form a sort of retina for a second-order 
perceptron) and send back variable-valued oc -system connections to the 
R-units. It is assumed that each R-unit (in both sets) receives connections 
from all of the Y ag units, and that the values of these connections can be 


corrected by an error-correction procedure, just as with the connections 


from the » layer. 


Suppose the perceptron has already been trained to recognize 
several kinds of visual objects (say squares and triangles) and has also been 
trained to recognize several spoken words ("square"' and triangle") for a 
variety of intonations, voice qualities, etc. During this training, the A» 
to R-unit back-connections have not been reinforced. Now let the perceptron 
hear the word "triangle", without any visual stimulus being present. The 
result will be an appropriate code-configuration in the R” unite, which 
will induce a characteristic: state of the A”) system , identifying the 
spoken word. By means of an error correction procedure, the perceptron 
can now be biased to give the PR” code for a triangle, and will hereafter 
tend to prefer this response to any others when the word "triangle" occurs. 
Consequently, when a composite stimulus is presented, as in Experiment 16, 


together with the spoken word ''triangle'', the system will tend to give the 
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R” response to the triangle, and due to the feedback connections to the A, 
set, and the action of the @ -servo, it will selectively augment the inputs to 
those A-units which respond to the triangle, while tending to suppress 
activity of A-units responding to other stimuli. Since all idiosynchratic forms 
of the spoken word, and all forms of the triangle-pattern, have been asso- 
ciated to identical response codes, the association will generalize immediately 
over both the audio class and the visual class of stimuli, without having to 


train the system with multiple examples of each. 


Thus the four-layer perceptron can be made to direct its 
attention in response to spoken commands in much the same way as the 
previous models, but without requiring a modification of the A-R connections, 
or "perceptual organization" of the network, in forming the linguistic asso- 
ciation. By a similar procedure, the A® to R® connections can be 
reinforced in the presence of a visual pattern to create a bias, or ''expentancy”", 
favoring the perception of the word corresponding to the perceived object. By 
replacing the &-system back-connections from A” to the R-units with 
f  _-system connections (as in Equation 2]1.1)the association can be made 
to occur ina relatively spontaneous fashion, by presenting the visual image 
together with its spoken name. The result will be a reinforcement of the 
connections from the A» set which responde jointly to the visual and 
auditory codes; since this set will have many units in common with the 


separate audio and visual A” sets, the reinforcement will tend to 


generalize, to yield the desired result. 


-496- 


Google 


This system can be used for the problem of searching for a 
named object which is not currently present in the visual field. For this 
task, one must assume that the RR” units are of a''flip-flop" variety, 
which tend to go on and stay on when they receive a sufficient input signal, 
until they are specifically cut off by a strong inhibitory signal. The system 
is taught to initiate an automatically controlled search or scan procedure in 
response to the spoken word ''search'. It is also trained (at the A” level) 
to turn off the search response whenever a coincidence occurs between a 
spoken name-code, and the visual object-code, but to leave the search-state 
alone when ether the name or object, but not both, are present. Thus, given 
the command "Search for square", the word ''search'' initiates the search 
activity, and the word "square" sets the system to anticipate a square pattern. 
When a square appears in the field, the A” set corresponding to the com- 
bined object-code and word-code is activated, and transmits a strong inhi- 
bitory signal to the search response, turning it off. It would be possible to go 
a step farther, by training the perceptron (which has now isolated the set of 
A 


square in the retina, using two continuous R-units to measure % and yy 


y units responding to the square) to continuously center the image of the 


displacements of the image from the center of the field (as in Section 10.2). 
Such a system, having found a moving stimulus, will track it and tend to 


keep it centered without being confused by the presence of extraneous objects 


in the field. 


-497- 


Google 


Original from 
UNIVERSITY OF MICHIGAN 


Digitized by Google 


216006-pd#asn ssad2e/buo'isnuyTUueyMMM//:djilyY / pazTiTHtTp-3 6005 ‘utewog 3T1qGN¢g 
99GOVS6EOSTOGE ‘dpw/770Z7/}au* a puey* py//:sdyiy / IW LZ:vO 6Z-OT-EZOZ UO Ad aeyJag ‘eTUIOJTLeD JO ATS4aATU Je paje4sauay 


22. PROGRAM -LEARNING PERCEPTRONS 


In the last chapter, we have seen that a back-coupled perceptron 
can be made to attend selectively to parts of a complex field, suppressing 
A-unit activity corresponding to objects other than the one attended to. In 
the last few paragraphs, it was also shown that such a perceptron can be 
made to anticipate decisions which are to be made at a future time, and 
execute them when the appropriate perceptual conditions are met. This 
lays the basis for the learning of sequential programs of responses in 


perceptrons. 


Programmed activity is, of course, of supreme importance in 
Carrying out logical sequences or algorithms, as in a digital computer. It 
also appears to provide a possible basis for the recognition of highly complex 
stimulus configurations, which depend on relations of simpler parts, rather 
than a fixed overall shape. The recognition of a human form, or an animal, 
is of this variety. It is also possible that the recognition of abstract topo- 
logical relations -- a problem which has hitherto defied all perceptrons 
analyzed -- can be performed by means of a suitable programmed sequence 
of observations. This writer has become increasingly convinced that a 
passive filter-type system (such as a simple perceptron) cannot be designed 
which will economically recognize topological abstractions and relations 
such as "A and B are disjoint" or "A is inside B" or "'A is a closed curve''. 
On the other hand, a perceptron which can attend selectively to part of the 
stimulus pattern at a time, and carry out a sequence of observations under 


program-control, seems to offer a potential solution to this problem. 
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22.1 Learning Fixed Response Sequences 


A perceptron of the back-coupled or cross-coupled variety can 
be taught to execute a fixed, stereotyped sequence of responses without 
introducing any new features in the system. If the sequence R*, Re ; R,* 
is required, for example, when stimulus S, occurs, but the inverse 
sequence (R", RI, PR") when S, occurs, it is only necessary to 
associate the required responses to the succession of A-states which 
follow the stimulus in the cross-coupled system, or to the A-states which 
result from the interaction of the retinal input and the R-A feedback, in the 
back-coupled system. Of these two approaches, the cross-coupled system is 
more versatile, since it can be triggered by a momentary stimulus, and will 
not return to an identical state if the same response condition should occur 
at different points in the sequence. The cross-coupled system, however, 
requires that the response sequence occur with exact timing of each element. 
If the triggering or execution of each response takes an indeterminate amount 
of time, then a closed-loop system of the type shown in Figure 63 would be 
more appropriate. This system (which is also applicable to the recognition 
of strings of sensory events, such as words or speech sounds, where each 
element of the sequence is of indeterminate duration) employs an A” system 
with units which tend to lock on once they are activated, unless specifically 
triggered. These units are of the same variety as the "flip-flop R-units" 
employed in the R* set in Section 21.3. The A” set is cross-coupled, 
with fixed connections, and feeds back (with fixed connections) to the A 


set. 
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a(2) 
(CROSS- 
COUPLED) 


Figure 63 FOUR-LAYER PERCEPTRON FOR RECOGNITION AND CONTROL OF & -SEQUENCES 
WITH ELEMENTS OF INDETERMINATE DURATION. 


When a response occurs in the R-set, it immediately triggers 
the os system to assume some characteristic state. The parameters 
of the cross-coupling at the x ) level can be so picked (e.g., by making 
all interconnections inhibitory) that the system will immediately assume a 
steady state, which will be held until. some subsequent response occurs. 
When the second response of the sequence occurs, it finds the effective 
thresholds of the A” units modified by the cross-coupling signals from 
the units which are already on. Consequently, the A® state which occurs 
will depend not only on the new response, but also on the previous Aw 
state. Unlike the previous cross-coupled systems, however, it does not 
depend on the time-lapse since the previous input, since the Iu state 


has held steady over the interval. 
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By means of the feedback to the A” set, the A” state 


(and consequently the response sequence) can be made to modify the response 
ofthe A” system to the present stimulus. Thus a distinct succession 

of responses can be associated to the stimulus, each new A” state 
signifying the joint information that the stimulus is present, and that a 
particular succession of responses has occurred in the past. To terminate 
such a sequence, it is possible to assume that one of the R-units has inhibi- 
tory connections to all A” units, so that when the end of the sequence 

is recognized, the <A”? system can be reset to its inactive state, by 


turning on this response. 


22.2 Conditional Response Sequences 


In the last section, the response sequences learned by the 
perceptron were assumed to be of a fixed, stereotyped variety, such as 
the utterance of a given word or phrase, or the execution of a particular 
sequence of movements. Of more general interest, is the possibility of 
conditional response sequences, where the execution of the next step 


depends upon the realization of a set of conditions at the present time. 


In a limited sense, we have already demonstrated the possibi- 
lity of conditional responses in the perceptron of Figure 63, where the 
next response depends not only upon the preceding R-sequence, but also 
upon the continuation of the initiating stimulus. A more interesting case, 
however, would be one in which the next response depends upon the recogni- 
tion of some condition which results from the preceding activity of the 


perceptron itself. For example,if the perceptron is equipped with a move- 
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able appendage by means of which it can apply pressure to external objects, 
we might ask it to push aside any object placed in front of it. Such objects 
right have their movement blocked, either to the right of to the left, in 
which case the perceptron might first bring its "pushing arm" into contact 
with the left side of an object and try pushing to the right, but if it finds 
that the object remains stationary, it must reverse the position of its arm, 
and push to the left. 


Such a decision program still seems to be within the capability 
of a perceptron of the type just described. It must recognize (through its 


visual inputs) the conditions "no object present'’,'' 


object present to right of 
arm location", "object present to left of arm location", arm in contact with 
left side but object stationary", "arm in contact with left side and object 
moving", etc. The recognition of the contact conditions might be facilitated 
by the inclusion of pressure transducers on the arm, providing an auxiliary 
sensory input to the association system. An appropriate response sequence 
must then be associated to each of these conditions. For example, if the 
condition "arm in contact with left side but object stationary'' is recognized, 


the response sequence might be 


l. Retract arm 
2. Shift arm position to right 
3. Extend arm 


This would then yield the condition ''object present to left of arm location", 


for which the response would be 


1. Shift arm to left until it contacts object 
2. Apply pressure 
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The conditions of "moving" and "stationary'' objects can, of course, be 
recognized by a perceptron with time delays from the retina to the Vi 
units, so that there is nothing in the above description which cannot be done, 


in principle, by perceptrons which have already been analyzed. 


22.3 Programs Requiring Data Storage 


In all of the sequential programs considered above, the next 
step has been determined entirely by the conditions at the previous step, 
and a knowledge of how many steps have already occurred in the current 
sequence. More elaborate programs require a conditional response based on 
information which was available several steps previously, but is no longer 
preaént in the sensory input. The perceptrons considered so far can solve 
such problems only by anticipating all possible sequences of conditions, 
and learning a unique response sequence for each special case. This rapidly 
becomes impractical,as the sequences become more involved. An example 
of such a problem is counting. In counting from zero upwards, we first 
produce a sequence of single digits, from one through nine; we then add a 
second digit (a one) and reset the low order digit to zero. The one in second 
place is held fixed, while the low order digits are recycled, and is then 
changed to two, and so forth. At an advanced stage in this procedure, we 
may be holding three or four high-order digits ''in memory" while modifying 
the low-order digits. To perform such a program expeditiously, an internal 
storage mechanism is required, which can be set to hold a given item of 
information and read out or altered whenever required. Such a memory 
mechanism is much more like a conventional digital computer memory than 


anything yet encountered in perceptron theory. 
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While it is fairly easy to contrive systems which employ rigidly de- 
termined gating mechanisms and more-or-less conventional computer memory 
logic to provide a temporary storage device for a perceptron, no realy satis~- 
fying solution has been found to date. A biological system undoubtedly employs 
something more subtle than a coded address system which transmits its 
stored information on command, but the similarity in logical requirements 
nonetheless suggests that there might be a similarity in structure at this 
particular point. It should be remembered, however, that human ability to 
perform complex algorithms without extensive practice and learning time 
does not begin to approach that of a digital computer. The human computer 
also tends to rely heavily on such external aids as pencil and paper to augment 
his memory for relevant data, and with the aid of an external transcription of 
its Outputs, a perceptron can also be made to perform rather elaborate logic 


(in the manner of section 22.2). 


Some possible cues as to the nature of temporary data storage in 
the human brain come from introspective observations of recall of strings of 
digits, words, or melodies, and such exercises as attempting to count in 
binary up to the point where one loses track of the number on which one is 
operating. In all of these cases, recall is helped by rhythmic grouping of 
elements, and by visualization or auditory imagery of the elements ina 
continuously recurrent sequence. It seems likely that an active memory, 
such as a reverberating loop system, which continuously rewrites itself 


on every rehearsal of the stored information, is involved. 
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22.4 Attention -Scanning and Perception of Complex Objects 


The preceding sections have dealt with the phenomena of 
program learning with respect to response sequences. A capability for 
program learning is also useful for the direction of attention over a sensory 
field, and the perception of a complex pattern or object by noting its parts 
and the relations between them. The possibility of directing attention 
selectively to part of the visual field was already observed in the last 
chapter. A program-controlled perceptron Could, therefore, be taught to 
direct its attention successively to different parts of the field in some syste- 
matic order, e.g., to scan from left to right, or top to bottom. It is also 
plausible (although it remains to be demonstrated) that a back-coupled 
perceptron can be taught to shift its field of attention along a contour, or 
edge of a figure, so that the association set, at any one time, responds 
only to part of the contour. Such a system, by starting at one point on a curve 
and following it in one direction, could determine whether the curve is closed 
or open'by indicating whether the scan process returns to its starting point 


without having lost the contour at any time. 


In the recognition of a complex structured object, such as a 
man (regardless of posture, angle of view, etc.) a program of observations 
might note significant parts and the transitions between them. There should, 
for example, be a head joined to the shoulders, and by following a path from 
one of the hands, the system should successively come to a forearm, shoulder, 
and torso. The reader may recognize a similarity between this suggestion 
and Hebb's concept of a "phase sequence" (Ref. 33). The phase sequence 


consists of a progression of cell-assemblies, each of which represents some 
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elementary perceptual fragment, the entire sequence representing a 
perception of a complex stimulus or experience. In the perceptron, however, 
the progression of states is assumed to be under the control of a learned 
program, which directs the attention of the system in such a way as to make 
first one set of A-units, then another set achieve dominance, by the 
mechanisms described in Chapter 21. A sequence-recognizing system, such 
as the five-layer perceptron shown in Figure 64, would be required for the 
direction of the scanning process and for the recognition of the total configu- 
ration from its parts. This system employs an a” layer of the same type 
as in Figure 63 (cross-coupled, with fixed interconnections, and A-units 
which hold their state until triggered by a sufficiently strong signal to change). 
The a 


both toanew R” set, which can learn to recognize complex patterns from 


set in this model, however, has variable-valued connections 


sequences of parts, and also back to the R” units, so that the system can 
be taught to direct its attention in a systematic manner to look for anticipated 


parts of the complex. 


6 -SERVO 
Ae 
ry 


Figure 64% FIVE-LAYER PERCEPTRON FOR RECOGNITION OF COMPLEX PATTERNS BY 
ATTENTION SCANNING PROGRAMS. (BROKEN ARROWS INDICATE 
VARIABLE CONNECTIONS). 
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22.5 Recognition of Abstract Relations 


It is apparent that the perceptrons proposed above are already 
stretching the limits of what has been firmly established analytically and 
experimentally. While there is good reason to think that the proposed 
systems would work in principle, they are highly speculative, and we are 
far frorn being able to describe their performance in quantitative terms. 
Nonetheless, one further venture in extrapolation seems to be of interest: 
As was previously noted, the recognition of abstract topological relations 
(or metric relations, for that matter) cannot be performed economically 
by a perceptron which is required to grasp the relation instantaneously from 
a complex pattern. The relation 'A is inside of B'', for example, would 
require that the system be trained with all possible cases of ''A inside B"' 
and "A outside B'', even after it has been taught to identify patterns "A" 
and "B'' correctly. It seems more likely that a program-controlled perceptron, 
having been taught to recognize patterns A and B, can determine whether A 


is inside of B by means of a directed scanning process. 


Suppose we show the perceptron a complex field, containing a 
circle and a square, both of which it has previously been taught to identify, 
and we ask the system to indicate whether the circle is inside or outside 
the square. This question could be answered by means of two attention 
sweeps, beginning at the circle and first sweeping to the right, then returning 
to the circle and sweeping to the left. If an edge of the square is encountered 
on one of the two sweeps but not on the other, then the circle is "outside" 


the square; if an edge is encountered both to the right of the circle and to the 
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left, the circle is "inside" the square. A somewhat more elaborate 
program would determine whether a known figure (e.g., a square or 


triangle) is inside or outside of an arbitrary closed curve. 


In the recognition of topological relations or metric relations 
(A is larger than B, or A is above B), and in programs which call for 
attention scanning, it would probably help considerably to introduce geometric 
constraints into the S-A and A-A connections of the perceptron. In the models 
which have been of primary interest up to this point, there is no way of telling, 
apart from learned association, that activity of a particular A-unit refers to a 
particular region of the sensory field. The A-unit space is non-topological in 
character; it has no well-defined geometry or dimensionality. This means 
that, apart from learning, there is no way of telling from observations on 
the state of the A-units, what are the topological or geometrical properties 
of the stimulus which is present on the retina. While it seems likely that a 
geometrically constrained organization of A-unit connections (e.g., increas- 
ing the probability of interconnection between A-units whose retinal fields 
lie in close proximity to one another) would be helpful, there is still no 
indication of what are the best constraints, or what gain in performance 


can actually be realized by such means. 
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23. SENSORY ANALYZING MECHANISMS 


The term "sensory analyzing mechanism" will be used for any 
signal transmission unit or network which detects and transmits information 
about selected parts or features of a total stimulus pattern. Such mechanisms 
can frequently be used to reduce the amount of information which the perceptron 
must be prepared to evaluate. They are particularly useful in highly organised 
environments (such as the familiar visual environment, or an environment of 
printed words or spoken language) where purely random stimuli are unlikely 
to occur or are of little interest. Thus a mechanism which detects boundaries 
of a solid image or describes gradients and contrasts in the visual field, or 
performs a Fourier analysis of an audio input, or which encodes speech into a 
sequence of phonemes, would be considered a sensory analyzing mechanism. 
A simple sensory unit which detects the level of illumination at a given point, 
or an A-unit which samples the illumination over a selected set of points are 


also sensory analyzing mechanisms. 


In most models considered thus far, little attempt has been made 
to optimize the sensory analyzing mechanisms employed. The random origin 
configurations which have generally been employed can be shown to be far 
from optimum. In this chapter, various methods of improving this primitive 
Organization will be considered, particularly with respect to visual and 
auditory systems. For the most part, these mechanisms are assumed to take 
the form of built-in constraints, such as were considered briefly in the d.i.d. 
models of Section 7.2.2, and the similarity-constrained perceptrons of 
Section 15.3. The existence of such mechanisms in biological organisms 


is supported by an increasing amount of evidence, such as Lettvin's studies of 
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frog vision (Ref. 51). Sutherland's studies of octopus vision (Ref. 98), 

Gibson and Walk on depth perception (Ref. 24), Sauer's work on bird navigation 
(Ref. 90), and Hubel's work on cat vision (Ref. 113). Since most of these 
mechanisms appear to be hereditary rather than learned, it seems likely 

that they may be realized either by simple spatial constraints in the distri- 
butions of connections in the sensory network, or else by simple "'typological 


constraints" governing the kinds of cells which may be interconnected. 


23.1 Visual Analyzing Mechanisms 


A number of basic strategies for processing visual information 
have been proposed. Some of these are so closely tied to digital computer 
processes that they are of little interest for a biological model, while others 
require such a degree of logical precision and so large a system as to be 
biologically implausible (e.g., Refs. 16, 17, 71). The techniques to be consi- 
dered here are grouped under four main headings: (1) Local property detectors; 
(2) Hierarchical retinal field organizations; (3) Sequential programs (centering 
and scanning methods); and (4) Sampling of sensory parameters. The possible 
advantages of each of these methods will be considered (largely in a quali- 
tative fashion), and the problem of an optimum mixture of analyzing 
mechanisms (somewhat analogous to the ''mixed strategy"' problem in game 


theory) will be discussed. 


23.1.1 Local Property Detectors 


The term "local property detector" will be used for any 
mechanism or neuron which responds to some particular feature of the 


stimulus pattern at a particular location (for example, brightness, color, 
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contour direction, etc.). Contour detectors and other types of property 
detectors have been deacribed by Culbertson (Ref. 17), Taylor (Ref. 99), 
Inselberg, Léfgren, and von Foerster (Ref. 4), and others. Lettvin and 
associates (Ref. 51) have described four mechanisms (for detection of 
contrast, convexity, or small spot detection, moving edge detection, and 
dimming detection) which appear to map into four distinct layers of the frog's 
tectum. Of particular interest for present purposes is the series of experi- 
ments described by Hubel (Ref. 113), in which the cells of a cat's visual 
cortex are shown to respond to lines and bars in particular positions and 


orientations, or to stimuli moving in particular directions. 


The visual property detectors which appear on an a priori basis 
to be of maximum value for pattern recognition in an ordinary terrestrial 
environment (where the main purpose of the system is to detect and recognize 


coherent physical objects )include the following: 


1) Brightness and color detection and measurement 

2) Contour and gradient detection 

3) Curvilinearity detection and measurement 

4) Detection of angles, intersections, and discontinuities 


of lines and boundaries 


5) Spot detection 


6) Sensing of textures,and measurement of texture 
gradients 
7) Velocity and accelleration detection and measurement 
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In order to recognize stimulus patterns or objects, information 
of the types listed above must somehow be combined for different parts of 
the retina, to provide an indication of the total configuration. This has been 
the main job of the association units, in the perceptrons considered thus far. 
In all cases considered in previous chapters, the A-units have formed 
combinatorial functions of information coming from "local intensity detectors" 
(the S-units); thus the only property detectors employed have been of the first 
type. The perceptron illustrated in Figure 65 introduces an additional layer 
of A-units immediately following the S-units, which can detect additional 
properties of the types indicated above. The ae layer, having its origin points 
inthe A () layer, now responds to combinations of local properties such 


as lines and gradients, rather than merely to points of light. 


al!) 
LOCAL 
PROPERTY 
DETECTORS 


(2) 
PROPERTY 
COMBINATION 
DETECTORS 


S 
POINT 
DETECTORS 


VALUES VALUES VALUES 


Figure 65 ORGANIZATION OF A PERCEPTRON EMPLOYING LOCAL PROPERTY DETECTORS. 
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The organization of origin fields for A-units serving as property 
detectors of various types is illustrated in Figure 66. The single-connection 
“point detector" serves merely as a logical relay for information which 
could be obtained equally well directly from the retina. The concentric 
field organization of the spot detector appears to be found (in the case of 
the cat) more characteristically in the retinal ganglion cells than in the 
visual cortex (Ref. 113). The various forms of line detectors and the 
"Type 2" termination detector have all been observed in the cat's cortex 
by Hubel. Hubel has also reported units which respond only to moving 
stimuli, although the organization appears to be different from that suggested 
in Fig. 66(a), for the "moving edge detector". There is some evidence that 
the movement detectors in the cat rely more upon the simultaneous summation 
of "off" signals from uncovered retinal points and ''on'' signals from retinal 


points which have just been covered by the displaced stimulus. 


The use of the Type 2 termination detectors is illustrated in 
Fig. 66(b). An A” unit which receives connections both from a termination 
detector and a line detector crossing the same field can recognize that the 
line approaches the inhibitory spot of the termination detector, but does not 
cross it. The same termination detector, taken in conjunction with lines at 
different angles, can serve to indicate termination of any one of the lines, so 
that there is considerable saving by this method. In fact, if there are & 
discriminable angles for straight lines, and r discriminable translates of 
each line, (so that there are about r* distinguishable termination-points 
scattered over the retina) then a system which employs Type 1 termination 


detectors would require a total of r*4 A units to guarantee a detector 
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fa) ORGANIZATION OF SENSORY FIELDS OF A (') UNITS. BROKEN LINES INDICATE FIELDS 
OF INHIBITORY ORIGIN POINTS; SOLID LINES INDICATE EXCITATORY FIELDS. 


RETINA A-UNITS 


POINT DETECTOR 


SPOT DETECTOR 


LINE DETECTOR (LIGHT ON DARK GROUND) 


R (DA 
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CAT CORTEX 


TERMINATION OR CORNER DETECTOR (TYPE 2) 


BOUNDARY OR GRADIENT DETECTOR 


TERMINATED LINE DETECTOR (TYPE |) 


CORNER DETECTOR (TYPE |) 


MOVING EDGE DETECTOR 


(b) tyPicac & (2) comBINATIONS. POSITION OF RETINAL FreLOS oF A (') UNiTs 1s SHOWN 
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Figure 66 ORIGIN FIELD ORGANIZATIONS FOR LOCAL PROPERTY DETECTORS 
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for each combination of angle and termination point. , The use of Type 2 
detectors.in conjunction with line detectors (as in Fig. 66(b))would require 
only r7+rk A) units, to convey the same information. If r and 4 


are both equal to 100, this means that 10° A") units are required with 


Type l units, and 2 x 104 with Type 2 units. This may indicate why the 
Type 2 configuration appears to be found in the cat, rather than the Type 1 


configurations. 


Figure 66(b) also demonstrates the multiple use of the same 
elementary property detectors ( All units) for a number of more complex 


functions atthe A (2) 


level. Thus, the unit @. is employed both ina 
terminated line detector and also as part of a moving line detector. Since 
movement detection can thus be obtained quite economically at the A ia! 
level, the type of moving edge detector illustrated in Figure 66(a) would 

tend to be obviated. Hubel's observations on the cat suggest that (although 
more complex organizations may remain to be discovered) the most promi- 
nent types of property detectors in the visual cortex are of very simple types, 
such as the line and boundary detectors and Type 2 termination detectors 
illustrated in Figure 66(a). In all of these cases, a single excitatory and 


inhibitory field, with simple constraints on the density of connections of 


each type, is sufficient to yield the mechanism indicated. 
The actual advantages which might be realiged by means of 
various types of property detectors have been investigated for several 


simple discrimination problems, with the results shown in Table 10. Two 


types of environments were considered: the first consists of the letter ''T'' 
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in right -side-up and upside-down orientations, and the second consists of 
the letter "L", also right-side-up and upside-down. Each letter can appear 
in all translational positions. The problem of discriminating the right-side- 
up "T'' from the upside-down ''T"' is considered for a variety of retinal sizes 
ranging from 20 x 20 to 1000 x 1000. The retina is assumed to be torroidally 
connected in all cases. With both the T andthe L , the horizontal line is 
taken to be nine units long, while the height of the letter is ten units. The 
thickness of the lines is one unit, throughout. The perceptrons considered 
are of the type shown in Figure 65, with the assumption that all inputs to 
A-units are excitatory. Rather than attempting to find optimum parameters 
for the various types of property detectors, the number of Ae inputs is 
always the minimum number which will permit the discrimination to be 
achieved. Other parameters (and the introduction of inhibitory connections) 
would undoubtedly permit more economical solutions, but this serves to 
illustrate basic principles. 


(2) 


The table gives the probabilities of finding A units which 
will discriminate between a given stimulus of the ''positive'' class (say the 
upright position) and all members of the opposite class. The origin points 


of the A® 


units are assumed to be chosen at random from among the 

Ae units. The first line of the table, in which the A” units are 
simple point detectors, corresponds to the case of a simple perceptron, 
where each A-unit receives its input connections directly from the retina. 
For such a system, it can easily be seen that at least two excitatory origins 
and a threshold of 2 are required in order to distinguish between the 


upright and upside-down ''L"', while three excitatory origins and a threshold 
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* 
of 3 are required to distinguish the upright from the upside-down "'T"'. 
The figures in the first two columns of the table are influenced by small- 
retina effects, which disappear for the 40 x 40 and larger retinas. 


Several general conclusions can be drawn from this table. 
First of all, it is clear that the value of different types of property detectors 
depends upon the stimuli to be discriminated as well as the size of the retina. 
For the discrimination of the L-shaped stimuli, which require only two points 
or blobs for discrimination, the best results are obtained with large (4 x 4) 
square origin point configurations for the A mw units, while for the T's 
a slightly elongated (4 x 5) configurations with a high threshold is preferable, 
since it permits the use of only two A (1) units instead of three per A 
unit. Note that the advantage of the rectangular origin configuration over 
the 4 x 4 square is pronounced only for large retinal sizes, however; for a 
smaller retina than 20 x 20, the square configuration might actually be 
preferable. For the conditions considered in this analysis, the following 
equation for the probability of a useful ai? unit shows the effect of 


increasing retinal size: 


™ 


= ene (23.1) 


The reader may find it instructive to examine the Q-matrices for a 
binomial perceptron in these problems, and satisfy himself that they 
are consistent with the geometrical requirement that three inputs and 
a threshold of 3 are required to discriminate between the upright and 
upside down ''T"', in all translational positions. 
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where NV, = number of S-points in retina 


m = number of useful combinations of al 


origin configurations 
for an al?) unit 


f 
f° = number of admissible rotational positions for each al / 


configuration 


4 = number of input connections to each A’? unit 


For a large retina, P clearly becomes small very rapidly, and the situation 


(2) 


is aggravated by the requirement of many inputs for each A “ unit. Thus 


for the discrimination of the upright and upside-down T , which requires 
three point inputs, P goes from 1074 for a 20 x 20 retina to about 1072 
for a 1000 x 1000 retina. The use of 4x 5 bars as line detectors instead 

of point configurations, while it improves the probability by more than three 
12 A 
if the T is to be discriminated reliably in the large retina. Even with 
od units is inadmissibly 
large. Nonetheless, the recognition of the position of a 9x10" Tina 


1000 x 1000 field is certainly well within the limits of human vision. Some 


orders of magnitude, still leaves a requirement for over 10 units 


optimum parameters, the required number of A 
additional means must therefore be found, to provide an economical solution 
for this problem without introducing a brainful of special ''T-detectors''. 
The principles discussed in the following section, combined with the use of 


property detectors, will be seen to yield a radical improvement in the 


recognition of small stimuli. 
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23.1.2  Heirarchical Retinal Field Organizations 


The "retinal field" of an A-unit is the region of the retina in 
which its origin points may be found. In a multi-layer system, the retinal 
field of an A” unit is the union of the retinal fields of the A” units 
which are connected to the A® unit; in general, the retinal field of an 
A» unit is the union of the retinal fields of the connected A — units. In 
a perceptron with a heirarchical retinal field organization, the retinal fields 
of the A-units tend to increase in area, the greater the logical distance of the 
A-unit from the retina. For example, the A” units may have highly local- 
ized origin configurations for the detection of local properties (as in Table 10); 
the A” units could then detect combinations of properties over a somewhat 
larger field (responding to small, compact figures or parts of larger patterns); 
and a layer of A” units might then be added to respond to combinations of 
sub-figures over the entire retina. While the general principle of organization 
is from small to large retinal fields as the A-units increase in depth, it is not 
required that all A-units at a given level have retinal fields of the same size; 
there may be A? units, for example, whose fields are larger than the 
smallest 7 Nas fields, provided the expected size of the retinal fields 


increases with increasing depth. 


Such a system is clearly much closer to the organization of 
the mammalian visual system than the uniform origin distributions which 
were considered in previous models. A brief consideration was given to 
constrained origin fields in Section 7.2.2, where it was found that no 
appreciable gain in performance was obtained with large stimuli, such as 
the squares and triangles of Experiment 7. The effects of employing cons- 


-522- 


Google 


trained retinal fields for the A” units in the perceptron of Figure 65 
will now be considered, for the range of retinal sizes shown in Table 10. 
It was found in the preceding section that as the retina becomes large 
relative to the size of the stimuli, the probability of finding a useful A™ 

unit becomes inadmissibly small in the unconstrained system. Table 11 

shows the effect of limiting A” retinal fields to a 20 x 20 region of the 
retina (located at random in a larger retina). Again, it should be remembered 
that the parameters have not been optimized, and that appreciably better 
results might be obtained with larger numbers of inputs to the A® units, 

and the inclusion of inhibitory connections. Nonetheless, a comparison 

with Table 10 illustrates the marked improvement in the size of the system 
necessary to achieve recognition in a large retina. The first column of 
probabilities (for the 20 x 20 retina) is, of course, identical to the correspon- 
ding column of Table 10, and the first line corresponds to a three-layer 

model with constrained origin fields for the A-units. In the case of the 

1000 x 1000 retina, using the best of the A” origin configurations, a 

gain of more than five orders of magnitude is obtained, bririging the discri- 
mination problem for the first time within the capacity of a human-sized 

brain model. Note, however, that the best A" origin configuration has 
shifted from the 4x 5 bar with 0 = 5 to the 4x 4 square with 0 = 1. 


The probability P’ of finding a useful A” unit in this system 


is given by the following equation, which is analogous to (23.1): 


™m N. ™ 
pS ke fe = oe 
(rN; ) Ng rN, N 
ss (23.2) 
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where. ™, r, and & are defined as for equation 23.1, N, = number of 
S-points in the retina, and Ne = number of S-points in the retinal field 
ofan A” unit. Taking the ratio of equations (23.2) and (23.1), we obtain 
the relative advantage of the conatrained retinal field system over the 


unconstrained system: 
&-i 
ge, ee 
P ~\N 


Thus the advantage increases exponentially with the number of connections 
required toeach A” unit, and with the ratio Nz/N! . Both of these 


(23.3) 


effects can be seen in Table ll. 


Clearly, if the system is required to recognize a stimulus of 
diameter D, the size of the retinal field cannot be taken smaller than D, 
without loss of performance; the above equations assume that the retinal 
field is large enough so that boundary effects can be neglected. The optimum 
size, then appears to be on the order of D, the expected stimulus diameter. 
We now have the problem of how to deal with universes of stimuli which vary 
in diameter from very small to very large patterns. The best choice of a 
distribution of retinal field sizes for the A” units will generally be one 
which guarantees the same likelihood of finding a useful A™ unit for all 
stimuli. For the particular case in which the stimulus diameter distribution 
is uniform between the limits D 
realized by taking 


in and Dex’ this can be approximately 


Prob (A=D') = I/D'E 


(23.4) 
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(2) 


where A _ = fraction of retinal area inan A retinal field 
omax 
a = d, ae (where OD is measured in retinal diameters) 
oO min 


Table 11 suggests that stimuli of the complexity of alphabetic 
characters ranging in size from .01 to 1 retinal diameter can be recognized 
by a system the size of the human brain ( 10 =e units) by employing a four- 
layer model, with a suitable combination of property detector configurations 
and a suitable distribution of A’? field diameters. The recognition problem 
can be made considerably more difficult, however, by adding additional degrees 
of freedom to the stimulus organizations. Consider, for example, the following 
environment: Let W consist of two classes of composite stimuli. Each 
stimulus consists of two9x10 T's , which may be located at any position 
in the retinal field, provided they are at least 10 retinal units apart. If 
both T's are right-side-up or if both are upside-down, the stimulus is a 
member of the positive claus; if one is right-side-up and the other is upside- 
down, the stimulus is in the negative class. Let us consider the probability 


(7) unit for this dichotomy. 


of finding a useful A 
If these stimuli are to be differentiated by A-units with random- 
point origin configurations (all excitatory, as in the previous examples) then 


a unit. By employing 


six connections and a 0 of 6 is required for each A 
one of the line-detector mechanisms of Table 10, 4 inputs and a 0 of 4 are 
required. The cénstrained-field system of Table 11 (with 20 x 20 retinal fields 


for the A a! units) cannot be employed here, as the combined stimulus 
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pattern may cover the entire retinal field. The best that can be done is to 
employ the A” configuration of 4x 5 bars, which yields a probability of 
6 x 107°? of finding a useful A” unit, with a 1000 x 1000 retina. (For 

the single random point configuration -- the worst case -- the probability 


is 7.34x 107>?.) 


By employing a five-layer topology, it is possible to take 
advantage of the fact that each stimulus actually consists of two organized 
sub-patterns, each having quite small dimensions relative to the retina. 
Assume the A units to have 20 x 20 retinal fields, as in Table 11, while 
the A” units have two excitatory input connections, chosen at random 
from among the A” units. Thus the A” units serve as local property 
detectors, the A” units serve as sub-pattern detectors, and the A” 
units integrate this information over the whole retinal field. (In this 
particular problem, the performance could be improved further by taking a 
larger number of input connections for each A” unit, but as before, we are 
trying to demonstrate basic principles rather than find optimum organizations. ) 
This five-layer system is compared with the four-layer system in Table 12. 
For moderate numbers of connections to the A” units in this system, the 
probability of a useful A” unit (with @=2 ) can be closely approximated by 
the binomial probability: 


(3) 
p" = (x; ) P’ (I-P 
za ( ) (23.5) 
where P' = probability of a useful A” unit for "sub-figure" 
discrimination, and 
(3) (3) : 
= number of (excitatory) input connections toan A unit 
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Thus with 25 inputs to each Aa unit the probabilities for the five-layer 

systems could be increased by a factor of about 300. Note that even under 
these conditions, however, while the problem becomes soluble for a brain- 
sized system in the case of a 100 by 100 retina, it is still unmanageable in 


the 1000 x 1000 retina. 


The difficulty of this problem for the large retina should not 
surprise us; it is unlikely that a human subject, asked to perform the 
indicated discrimination with tachistoscopically presented stimuli, could do 
appreciably better than chance, where the two T's each subtend only 1/100 
of the central visual field, and are located at random relative to one another. 
Even the case of the 100 x 100 retina (where the T's subtend 1/10 the 
diameter of the field) would probably yield marginal results, if the subject 
were not permitted time to scan the field or shift his attention during the 
exposure. On the other hand, if the T's were constrained to lie relatively 
close to one another (say within a 40 x 40 subfield) the problem would 
probably not be difficult. This problem, however, could readily be handled 
by a five-layer perceptron in which the Ae? retinal fields were constrained 


@) fields to 20 x 20, as before. 


to a 40 x 40 region, while limiting the A 
Thus it appears that a heirarchical organization with three association layers 
is competitive with human visual performance, with respect to resolution of 
detailed figures, and recognition of complexes of sub-figures, under condi- 


tions in which no scanning or shifting of attention is allowed. 
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If we were to complicate the problem by adding a third "T"’ , 
again placing the stimuli in the positive class if all T's face the same way, 
and the negative class if some face up and some down, the probabilities of 
finding suitable A’? and A? units would again fall by many orders of 
magnitude. For this problem, it is unlikely that any purely spatial and 
parametric constraints on the network would permit a solution with only 10 nv 
units, with a retina appreciably greater than the size of the stimuli. It is 
also unlikely that a human subject, under tachistoscopic conditions, could do 
much better. Thus for complex organizations of organized sub-figures, each 
of which has several degrees of freedom independently of the others, some 
additional strategy must be sought to improve recognition capability. The use 


of sequential observations seems to be indicated at this point. 


23.1.3 Sequential Observation Programs 


The perceptrons considered in the last two sections, while 
facilitating the discrimination of small patterns in which fine details provide 
the essential information, are still far from optimum. For one thing, the 
number of A-units required remains very large; for another thing, the 
learning time would be correspondingly great, if the discrimination must be 
learned for all combinations of figural elements. These difficulties can be 
drastically reduced by the employment of a program-learning perceptron, such 
as the models considered in the last chapter. In particular, a system of the 
type described in Section 22.4, with a selective attention mechanism which 
permits it to attend to one detail or sub-figure at a time, is likely to prove 
useful in dealing with complex stimuli. Such a system can be employed in 


at least two basic ways: 
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1) It can be taught to recognize the presence of a sub-pattern 
(a spot or region of in which the fine structure is particularly dense) without 
having to classify it or differentiate it precisely. It can then direct the visual 
centering mechanisms to bring this pattern to the center of the retina, where 
high-resolution is possible, and where the system is taught to differentiate 


the type of pattern more precisely. 


2) The perceptron may be taught to examine each of a number 
of retinal regions in turn (either by a systematic scanning procedure, by 
following boundaries, or by directing attention to those sub-fields in which 
the fine structure is particularly dense). This will result in the recognition 
of a definite sequence of details, which, in its entirety, serves to identify 


the complex stimulus organization. 


The recognition of small objects in a large field may best be 
achieved by the first of these methods, while the discrimination of complex 
organizations (e.g., individual faces) requires the second method. In 
employing the second method, it would be particularly helpful if the 
perceptron could shift its field of attention systematically in a given 
direction, with the direction of attention shift provided as an additional 
piece of information to the association system at all times. In this case, 
the general configuration of the letter "A" followed by the letter ''B" 
followed by ''C"' could be racognized by starting from the left of the field, 
shifting attention right to the first ''detail'', then right again to the second 
detail, and then right again to the third. The recognition of this complete 
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sequence would indicate the ABC configuration regardless of the actual 
positions of the letters in the field or their relative distances. It seems 
likely that the general problem of relation-recognition will ultimately yield 


only to sequential programs of this type.” 


23.1.4 Sampling of Sensory Parameters 


A fourth basic strategy for simplifying the sensory data which 
the perceptron must deal with is that of independent sampling of sensory 
parameters. Ina general visual input system, five parameters are of 
interest: the intensity of illumination at a point, the frequency or color of 
the illumination, the time at which it occurs, and the z and y coordinates 
of the location of the point on the retinal surface. Each of these variables 
may be varied independently of the others. If we required a retina of 1000 
lines resolution (i.e., 10° points), with sensitivity to 10 frequency bands, 
10 levels of illumination, and 10 time delays for the outputs of each S-point, 
a total of 10? retinal points would be required to provide a sensory unit for 


each combination of values. 


If it is actually required to discriminate between any two patterns, 
no matter how minute the difference between them, then there is no way of 
escaping this requirement. In general, however, we are satisfied with 
approximate information, and it is only under special conditions of ''good 
observation" that we expect to obtain the highest resolution from the system. 


We can take advantage of this by means of the following organization. 


“One sequential mechanism which may greatly improve performance is to take 
a sequence of "looks" at a given stimulus, with different fixation points selec- 
ted at random, accepting a majority decision for the final response. The gains 
which might be expected, assuming independence between "looks", have been 
discussed in Reference 79, pp. 156-157. 
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Suppose we limit the number of retinal points to 10 : - To each 
of these S-points, ~ and y coordinates are assigned at random (from a uni- 
form distribution over the whole field, rather than just points on a 1000 by 
1000 lattice). In addition, a frequency drawn at random from the sensitivity 
range of the system is assigned to each S-point, and a threshold and time 
delay are similarly assigned at random. Now, if the perceptron sees a 
moving figure, with a variety of shading and color variation, it will be less 
precise in its judgement as to the exact position of the figure attime ¢ , or 
the color of a given point in the retinal field at time t , than would be the 
case with the ''complete"' system with 10°? S-points. If, however, we ''fix"' 
the position of the figure on the retina, and provide maximum contrast 
between illuminated and non-illuminated points (i.e, sharpen the figure to 
a black and white silhouette), and observe it for long enoughto permit all 
time delays to propagate, then we have just as good shape-definition as in the 
system with 10” S-points, since all 10 retinal points will contribute one 
bit of information. Alternatively, if the entire field is illuminated at maximum 
intensity with a given frequency of light, this frequency can be discriminated 
to one partin 10, or five orders of magnitude better than the previous 
model. The same will be true with respect to intensity discrimination if 
the field is illuminated with white light, all frequency components being 
present with the same intensity. Similarly,,the velocity, acceleration, 
and higher derivatives of the velocity of a moving object can be discrimi- 
nated much better with the 10 ? element randomized-parameter system, 
provided the moving image consists of a sharp black and white pattern. 
Finally, we note that if we wish to specify the exact retinal coordinates of 


a square, the edges of which are alligned with the lattice pointga.in the first 
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model, we can expect a maximum accuracy of one part in 1000, whereas 
with the random configuration (where some of the points will fall virtually 
on the boundary of the square regardless of its location) we could expect 


to improve the performance by several orders of magnitude. 


What is sacrificed in this system is the ability to provide full 
information about individual retinal points, and the ability to provide maximum 
precision of discrimination in the case of shaded, moving figures, It would 
be difficult, for example, to precisely locate the boundary of a moving cloud, 
or to state the exact colors of specified points in a continuously varying 
mixture of colored lights; these are precisely the conditions, however, 
under which a human observer would also encounter difficulty, whereas if 
we optimize the conditions of observation by providing stationary figures and 
sharp contrast, resolution far in excess of the "fixed lattice system" can be 
obtained. Note that there is a trade-off between the resolution obtainable in 
one parameter and the resolution in other parameters; we cannot simultaneously 
optimize conditions for observing position and velocity, or color and intensity, 
for example. An interesting analogy can be drawn to the limitations on 
simultaneous observation of related variables in quantum mechanics, although 
there is no reason to suppose that the analogy is anything other than coinci- 


dental. 
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23.1.5 "Mixed Strategies" and the Design of General Purpose Systems 


In the preceding sections, it has been demonstrated that the kind 
of network organization which is best suited for one stimulus environment 
or discrimination problem may be far from optimum for a different problem. 
The upright and upside-down T's , for example, might best be discriminated 
by a specially designed T -detector; but in this case every other letter, or 
combination of lines which might be encountered would have to have its own 
special detector mechanism, and the system would be useless in a general 
environment. Thus the question arises, if we know only the general character 
of an environment, but cannot anticipate all discriminations that the perceptron 
may be required to learn, what is the best combination of stimulus analyzing 


mechanisms to provide a good ''general purpose" system? 


This problem (on which no real analysis has been done to date) 
seems to be related, at least superficially, to the mixed strategy problem 
in game theory. The object of the game is to minimize the probability that 
any discrimination problem likely to arise in nature will be insoluble, subject 
to constraints on the size of the system, admissible learning times, etc. In 


: fields was 


Equation (23.4) a proposed solution for the distribution of Tig 
presented, for the special case in which the stimulus diameters are uniformly 
distributed. A more general solution should also consider the best mixture 
of line-detectors, spot-detectors, point-combination detectors, etc., among 
the A ie units, the number of layers to be employed and the distribution of 


retinal fields among them, etc. 
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A few general rules seem to have emerged from studies thus 
far. For one thing, it seems to be inadvisable to seek highly specialized 
property detectors in the early stages of the network. A few basic types, 
such as line and boundary detectors, spot detectors, termination detectors, 
and movement detectors are certainly helpful, and yield appreciable 
improvements over random-point combinations. But higher-level organizations 
seem to be achieved better either by a mixture of simple properties at a 
greater logical depth in the network (as in the five-layer system considered 
in Section 23.2.2) or else by learning, at the R-unit level. For another 
thing, the extension in depth of a heirarchical retinal field system is useful 
for a limited number of levels, but extension much beyond three association 
layers seems unlikely to improve capabilities appreciably in systems the size 
of the human brain. Recognition problems which cannot be dealt with by a 
five-layer heirarchical structure, due to the large number of small details 
which must be considered in solving the problem, are best handled by a 
sequential system, rather than by continuing to increase the depth of the 


network. 


It is questionable whether analytic procedures will be able to 
make much headway in dealing with this problem, although a combined attack 
with simulation techniques and analysis wherever applicable should yield 
considerably better information concerning the optimum organization fora 


given visual universe. 
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23.2 Audio-Analyzing Mechanisms 


The sensory analyzing mechanisms which are best suited to 
an auditory input system are in some respects similar to those which have 
been considered for visual inputs. The difference in character of typical 
auditory patterns (speech in particular), where temporal organization largely 
takes the place of spatial organization, leads to a number of distinctive 
requirements. The following sections consider several of these special 


problems. 


23.2.1. Fourier Analysis and Parameter Sampling 


In principle, a number of possible sensory representations 
could be used for auditory material, including the continuous measurement 
of the amplitude of a waveform; spectral analysis, with the amplitudes given 
for all frequency components as a function of time; and various ''reduced 
information" systems, such as the indication of zero-crossings, or the 
outputs of special filter systems. In the human auditory system, phase 
information appears to be disregarded, and a Fourier analysis into spectral 
components is employed. In a system designed to simulate human perform- 
ance in speech recognition, musical recognition, and related problems, a 
presentation of the actual waveform would burden the system with a great 
deal of excessive information. The same word spoken with slightly different 
phase relations between the frequency components, for example, would 
present completely different wave-shapes, which the perceptron would 
have to learn to identify. Thus the spectral analysis of the audio input 


seems preferable. 
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With a Fourier analyzed input, the important sensory parameters 
to be represented by an S-point are the frequency, amplitude (or threshold), 
and time relative to the present (generally represented by connection delays). 
With these three variables, the principle of independent sampling of sensory 
parameters, discussed in Section 23.1.4, is again applicable. If the system 
is required to discriminate 100 frequencies, 100 time delays, and 100 ampli- 
tudes, for. example, then a total of 10° frequency-threshold-delay 
combinations would be required with a "complete lattice" system. Using 
independently sampled parameters, on the other hand, a system with only 
1000 S-units could discriminate 1000 frequencies in an intense sustained 
tone or mixture; it could discriminate 1000 amplitude levels in a "white 
noise" mixture sustained for the duration of the maximum time delays; or 
it could place the occurrence of an intense "pip" of white noise to a 
precision of one part in 1000 in time. Under less optimum conditions, the 
accuracy of discrimination in separate dimensions would be reduced, but 
the composite organization could still be discriminated readily from an 


appreciably different organization. 


23.2222 A Phoneme -Analyzing Perceptron 


An introductory discussion of the phenomena of speech per- 
ception can be found in the chapters by Licklider and Miller in Ref. 112. 
Perceptrons for speech recognition and the association of names with 
objects or events have been discussed in Section 21.3. In these systems, 
it is assumed that a complete word must be learned as a primitive pattern, 
without preliminary analysis into significant sounds, or phonemes. _ In 
this section, a more sophisticated perceptron, capable of phonemic analysis, 


will be described. 
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The possible improvement in efficiency which can be obtained 
by analyzing a word into a sequence of phonemes can be highly significant. 
If we consider a hypothetical (and rather unnatural) language in which there 
are 100 allophones (or functionally equivalent sounds) for each phoneme, and 
a word of five phonemes consists of an independent choice of one of the 
allophones for each phoneme, then the word may appear in any one of 
100° = 10 sa possible forms. For a perceptron with a high degree of 
sensitivity to differences in sound patterns, this would mean that the 
discrimination of two words would require an enormous number of 
utterances (perhaps many millions) in order to generalize to all equivalent 
pronounciations (allomorphs) which might occur. (In actuality, the 
correlation between choices of allophones for different phonemes, in 
ordinary speech, would greatly reduce the sample size required, but the 
example will serve for illustrative purposes.) On the other hand, if each 
phoneme were first recognized by a distinct R-unit, and the outpute of the 
R-units taken as the input for a word recognizing perceptron, this second 
perceptron would receive an invariant sequence for each word, and in prin- 
ciple a single utterance of each word (morphene) would be sufficient for 
complete generalization. The phoneme-recognizing units would each have 
to distinguish a set of 100 allophones from a universe of 500 (assuming that 
only five phonemes are involved, so that the learning at this level might be 


achieved quite readily. 


In actuality, the recognition of a phoneme is not as simple 
as the above discussion suggests, since a single speech sound cannot, in 
general, be recognized independently of its context. The preceding and 
subsequent sounds may completely alter the sound of a vowel, for example. 
Thus a phoneme-recognizing perceptron must itself be a sequence- 


recognition device, rather than a momentary-pattern recognizer. 
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A perceptron which appears to be capable of analyzing a 
sequence of words, so as to spontaneously develop an internal code for the 
phonemes employed is illustrated in Figure 67. It is a five layer perceptron, 
with variable connections between the A’ and A” layers, and between 
the A” layer and the R-units. The A” layer can be thought of as 
playing the role of 'R-units'' for the first three layers of the perceptron, 
and will eventually learn the phoneme code to be employed. At the same 
time, it serves as the ''sensory system!" for the last three layers, which 
act aa a three-layer perceptron for word-recognition. The A system 
may either be organized as a cross-coupled system, or its input connections 
may be given a spectrum of delays; in either case, it is capable of 
recognizing sequences of inputs, rather than just momentary patterns. If 
the A” units are cross-coupled (particularly with inhibitory connections ) 
and are of the "flip-flop" variety, tending to remain in their present ''on" or 
"off" state until receiving a super-threshold signal, then the a” system 
will tend to go to a state characteristic of the sequence of input patterns 
regardless of the duration of the individual patterns in the sequence. This 
is particularly true if the A” system goes through a sequence of states 
(A, B, C,...) where each state is "held'' without variation for a time 
greater than the ''settling-down time" of the A” system (which should 
normally be no greater than two or three transmission delays, for the 
conditions given). Thus a ''word'' encoded into a sequence of phonemes 
by the A’? units would lead to a fixed state of the A” eyctern op om its 


uz 


termination, regardless of the actual duration of the phonemes. 


* This effect, as well as some of the others discussed in this section, 


might be employed to advantage in a visual system which is required 
to recognize sequences of stimuli, such as successively presented 


letters or signals. 
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(3) 


The reinforcement rule for the A to R-unit connections is a 
conventional oa¢ -system rule, so that an error correction procedure may 
be employed to teach the system to recognize words. The reinforcement 
rule for the Al to A? connections, however, is a probabilistic one, 
defined as follows: 


(1) (2) 


Le With each connection, » froman A to an A unit 


és 
is associated a time-dependent probability, /;;(¢) , called the instability 
coefficient of the connection. 

ae Reinforcement at the preterminal level ( Al to Ae 
network) is applied only upon the decision of the reinforcement control system, 
or experimenter. Otherwise, the values of these connections remain 


unchanged. 


ce If preterminal reinforcement is applied attime ¢ , all 
instability coefficients are changed by the amount 4P;; = a;€ -o P,; (¢), [o <e<t.. 
If no reinforcement is applied attime ¢, AP;; = -o A, (t) 


4. If reinforcement is applied, assume that the current 
activity states of all A?) units are 'wrong'', and apply the correction 
Av;; =1 “%-(al9) with probability P.-(t) . (This is equivalent 
toan O¢ -system error correction applied probabilistically. ) 


The actual training procedure can best be described in terms 


of the following experiment: 
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Assume a language, L, possessing three phonemes, A, B, and 
C, with k allomorphs of each phoneme. Time is quantized in units At. 
Each phoneme persists for a duration At , unless otherwise indicated. Let 
L consists of the six words, AB, BA, AC, CA, BC, CB. Assume some 
output code, Rta) is assigned to each word, « . Then the procedure for 


training the perceptron is as follows: 


Present a randomly chosen allomorph of the first word (AB), and 
observe the response of the perceptron. If this is correct, go on to the next 
word (BA); if it is incorrect, present AB again,‘ this'time applying 4 (quantized) 
error-correction reinforcement to the terminal connections ( A? to R-units). 
Again test the response to the word AB. If the response is now correct, go 
on to the next word; otherwise, present the word again, this time reinforcing 

(1) (2) 


the preterminal network ( 4 to A connections) and leaving the 


terminal network unaltered. Then apply a second correction to the terminal 
network, and retest the response to AB. Continue alternating between 
reinforcements applied to the terminal network and reinforcements applied 
to the preterminal network, until AB elicits the correct response. Then go 
on to the next word (BA) and repeat the same procedure. Continue cycling 


through the complete vocabulary until all words have been learned correctly. 


A very limited amount of experimental work has been done with 
this system, using coin-tossing experiments and pecil-and-paper simulation 
techniques to investigate performance for the three-phoneme language 
considered above. Note that in this experiment, the perceptron is never 
given a single phoneme in isolation, but always as part of a two-phoneme 
word. Moreover, the perceptron is never corrected for ''mistakes" in a 


single phoneme; reinforcements applied to the preterminal network are 
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maintained for the duration of an entire word, regardless of whether one or 
both phonemes are causing the difficulty. Nonetheless, it is found that as 
long as the number of A” units is greater than the number of phonemes 

( Ni» 5 has been found to work well), the system tends to form a 


(a) 


phoneme-code at theA level; i.e., after a period of training, each 


phoneme (A, B, and C) activates a different set of A? units, and all allo- 
(2) 
A 


phones of a given phoneme tend to activate the identical set of units. 

These results can be obtained in a very short training sequence 
(generally less than one complete run through the 6-word vocabulary) with 
a suitable choice of the parameters € and dg (which determine the rate of 
growth and decay of the instability coefficients, P;; ). On the other hand, 
no deterministic system has been found which will yield comparable results, 
although something like a dozen alternatives have been tried. A rough heuris- 
tic explanation for the observed effect can be given as follows: When the system 


arrives at some state in which the activities of the Ned 


units constitute a phoneme- 
code for the language, new words can generally be learned with at most one or two 
reinforcements of the terminal network, so that there is little occasion to re- 
inforce the preterminal connections. Consequently, the instability coefficients, 
P., , all decay towards zero, and the probability of disrupting the learned 

code, even if a reinforcement of the terminal network does fail to correct 
anerror, is negligible. On the other hand, if any two phonemes are assigned 

the same code, there will be repeated confusions of words which can only be 
distinguished by means of the undiscriminated phonemes. Consequently, the 
preterminal network will frequently be reinforced for words containing these 
phonemes, but not for other words. Therefore, the connections originating 

from A units which are activated by one of the conflicting phonemes will tend 

to acquire large instability coefficients, leading eventually to the modification 

of the A responses to these phonemes. But since the corrections are applied 


probabilistically, the system will tend to try out arbitrary A” codes, and is 
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thus immune to ''trapping'' cycles, which tend to occur in deterministic 
models. In brief, the effect of the instability coefficients is to make those 
connections most suspectible to change which are most troublesome to the 


system. 


It remains to be seen why the system tends to assign the same 


A (2) 


up totally unique codes for every input pattern. In part, this is helped by 


code to all allophones of a given phoneme, rather than merely making 


keeping the number of a units small, so that conflicts are likely to arise 

if the code is not an economical one. The main effect, however, is due to the 
fact that different allophones of a given speech sound are not arbitrary, 
independent patterns, but tend to be highly correlated in the frequency-time- 
amplitude picture which comes from the sensory system. Thus the condi- 
tions are ideally suited for generalization from one allophone to nearly 
identical sounds, from there to next-nearest neighbors, etc. In fact, the 
tendency would be to classify all sounds identically (due to positive 9; : 
coefficients inan a-system) were it not for the intervention of the experimenter 
or r.c.s., which forces the separation of significantly different sound patterns. 
The spontaneous clustering of ''similar'' sounds can be compared to the 
spontaneous clustering of ''similar" visual stimuli discussed in Section 7. 3, 
and demonstrated for a 7°-system in Experiment 9 (Page 214). 


AQ 


By adding fixed back-connections from the y tie to the units in. 


the perceptron of Figure 67, the recognition of individual phonemes may be 


more readily influenced by the preceding sequence. Alternatively, variable- 
3) a 
¢ to a’ : units might be conditioned, by a 


(2) 


valued back-connections from A 
suitable training procedure, to provide a bias to the A’ units, tending to favor 
the recognition of the most probable next phoneme, as determined by the 


prior sequence. 
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While the above discussion has concentrated on demonstrating the 
possibility of a self-organizing mechanism for phoneme analysis, it is also 
possible to employ a somewhat simpler version of the five-layer system in 
which the al? units are actually trained by the experimenter to emit a chosen 
code for each phoneme. In this case, the Act units are actually R-unitse, and 
the probabilistic reinforcement rule for the pre-terminal network is no longer 
necessary, an ordinary ~-system error correction procedure being perfectly 
suitable. One might also consider the possibility of extending the five-layer 
system in depth, by adding another A-unit layer and terminal R-layer after the 
last layer of the present model. By reinforcing first the terminal connections, 


1! . 
” outputs (in case of failure to correct the 


{3 
then the A” outputs, and finally the A 
mistake at the terminal level), the system might be expected to develop a 
phoneme code in the initial part of the network, a syllable code in .ne middle, 


and a code for complete words or phrases at the level of the final R-units. 


23.2.3 Melodic Bias ina Cross -Coupled Audio-Perceptron 


The final stimulus analyzing mechanism to be considered is one 
which seems likely to occur spontaneously in cross-coupled perceptrons (of 
the type analyzed in Chapter 19) with audio-inputs. Suppose such a perceptron 
is exposed to a random sequence of notes, covering a range of several octaves, 
and played by a variety of string and wind instruments. Each note is held long 
enough for the cross-connections of the association system to be reinforced, 
before the next note is sounded. Then, assuming that the input comes from a 
Fourier analyzing system, the fundamental will be associated most strongly 
to the major overtones of the sequences characterizing the instruments employed. 
Thus the main association will generally be to the octave above or below, next 
to intervals of a major fourth and fifth, etc. This means that the main 
harmonic intervals of a twelve-tone scale will tend to predominate, rather 


than purely random frequency associations. 
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Such a system will tend to respond most unambiguously to chords 
and combinations of notes bearing a simple harmonic relation to one another 
(e.g., major fifths, fourths, and octaves) while strongly discordant combin- 
ations will tend to create a conflict (particularly in a )'-system) such that 
the system tends to oscillate between several alternative and mutually 


competitive activity states. 


By adding variable-valued back-connections from R-units to A-units 
(as in Figure 60), and associating a different response to each fundamental 
tone, the perceptron can be made to emit responses corresponding to a 
melodic sequence, if each response in turn is suppressed shortly after it is 
turned on. Sucha perceptron, preconditioned as above, will tend to pick a 
harmonically consistent sequence, probably avoiding major shifts in tonality 


except by means of gradual progressions. 


These observations, although suggestive, should not be over- 
interpreted. It seems plausible that melodic and harmonic biases in music 
have a fundamental basis in the overtone series (as Hindemith has suggested); 
however, the ease of vocal transition from one note to the next, and other 
considerations which play no part in the above model, are undoubtedly of equal 
importance in the determination of musical traditions and the conditioning of 


musical perception. 
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24. PERCEPTION OF FIGURAL UNITY 


In almost all tests of perceptron performance considered in 
previous chapters, the environment, or stimulus world, was assumed to 
consist of discrete objects, or events, occurring one at a time in an ordered 
sequence. The actual physical environment which we experience on a day-to- 
day basis is not of this form; the visual field, in particular, is likely to contain 
a large number of different objects, patterns, or constellations of objects 
simultaneously. In human perception, it is easy to detect and name familiar 
objects in an unfamiliar scene, such as a landscape or a strange room. Fora 
perceptron, each such combination of objects represents a new ''composite'' 
stimulus. If the composite organization consists of familiar patterns which 
have previously been learned in isolation, then it has been demonstrated that 
the perceptron may attend selectively to one object or pattern, and respond 
consistently to this object (see Chapter 21). For the human observer, however, 
it is not necessary for the individual objects or component patterns in the field 
to have been previously learned individually; totally new and unfamiliar organi- 
zations may nonetheless be perceived as ''objects'’. Other organizations, no 
matter how familiar, will always be perceived as sets of objects, rather than 


as single entities. 


The organization of a complex field into "objects" or distinct entities 
is frequently ambiguous, in that the field permits many alternative constructions 
or organizations of ''meaningful parts''’. Problems of reversible perspective, 
the interpretation of Rorschach ink blots, or the detection of alphabetic charac- 
ters in collections of random lines, all serve to indicate this ambiguity. The 
recognition of an ''object"' in the human perceptual process is generally experi- 
enced as a figure-ground organization, in which the object emerges as ''figure'' 
while the rest of the field serves as "ground". Hebb, who holds the segregation 
of figural patterns to be an innate process, has proposed the term "primitive 


unity'' for such figural entities (Ref. 33). The perception of such unity is clearly 
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essential for an organism which must move about and interact with the objects 
of its environment. It applies not only to spatial organizations but to temporal 
sequences as well; a sequence of human movements is broken up, perceptually, 
into acts, steps, or gestures, while speech or music is divided into words or 


phrases, even if the sequence of sounds is an unfamiliar one. 


The Gestalt psychologists have considered the problem of figural 
unity from the standpoint of what constitutes a ''good figure" (c.f., Ref. 44). 
It is assumed that certain organizational properties of the stimulus field lead 
to a preference for one figural organization rather than another, and considerable 
experimental data have been gathered on the influence of such factors as contrast 
boundedness, connectedness, and the like. There is no doubt that all of these 
factors are important determinants of figure-organization in human perception. 


For present purposes, however, we will attempt to work with the hypothesis that 


what is most readily seenasa figural entity ina given environment tends to be 
an organization which is likely to undergo a continuous transformation in that 


environment (e.g., a detachable rigid object, or surface bounded by discontinu- 
ities). Whether the patterns which are most likely to be operated on by a 
continuous transformation are learned or innately recognized is left open, for 
the time being; it seems likely that both innate and acquired biases are at work 


in human vision. 


Posing the problem in this form suggests that the system must be 
sensitive to cues indicating rigid, moveable objects, or surfaces (such as the 
faces of a cube) whose two-dimensional projections may undergo transformations 
which are discontinuous at their boundaries (i.e., the object moves, but adjoining 
regions of the field do not, or undergo a different kind of motion). The attempt 
to define figural objects as connected blobs of uniform illumination (as has been 
advocated in several computer programs) seems quite inapplicable, except under 


highly contrived and artificial conditions. It seems likely that in actuality, a 
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combination of many different cues of "good figure" are at work simultaneously, 
the final organization being arrived at by an active process, typically involving 


a good deal of trial and error before a good "fit" is obtained. 


The cues which are suggested by psychological experiments as being 
influential in the determination of figural organization, or the perception of 


separate entities, include the following: 


1) Differential motion of textured or bounded regions, or sets of 


points in the retinal field. 


2) Cues indicating differential distance or 'depth'" of surfaces, or 


sets of points. 


3) Differential surface properties in a bounded region (e.g., color, 


texture, or type of fine-structure). 
4) Contours, boundaries, or discontinuities in surface gradients. 
5) Object familiarity. 


These five types of information are listed in approximate order of their 
strength, or dominance. If two constellations of points in a visual field are 
seen in relative motion, then even if they are intermixed spatially, they will 
tend to be seen as distinct objects, and the observer will have difficulty attend- 
ing to both simultaneously. This is illustrated by the view of a moving scene 
outside a dust-streaked train window: either the window or the outside scene 
can be viewed as an object, but not both in combination. An experiment by 
Gibson employs’ motion pictures of talcum powder scattered on two glass plates, 
one behind the other. As long as both plates are stationary, or both moved 
jointly, the two planes cannot be separated; as soon as differential motion is 
introduced, however, the picture breaks up unmistakeably into two planes, 


each with its own distribution of spots. 
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The relationship of depth to figure organization is well known, and 
suggests that an attack on problems of depth perception in perceptrons will also 
contribute a great deal to the figural unity problem. The remaining cues 
(contrasting siirface areas, boundaries, and familiar object recognition) are 
the ones most generally incorporated in attempts at devising computer programs 
or nerve-net models for figure segregation. It should be noted that the last of 
these (object familiarity) is the only one demonstrated as workable in perceptrons 
up to this point, in the selective attention mechanisms of Chapter 21; nonetheless, 
this mechanism is only useable under relatively ideal conditions, in which objects 
are present without overlap, confusing lines, spots, or ''camouflage", and where 
it can be assumed that a pattern which contains the sensory components of a 
familiar object actually represents the object, rather than a random concatin- 


ation of lines or points of illumination. 


In order to evaluate the performance of a perceptron in the realm of 
figural organization, or the perception of unity'', a suitable set of criterion 
experiments must be defined. This proves much more difficult than in the 
testing of discrimination capabilities, or the study of generalization over a given 
group of transformations. In the simplest case, we may require a decision as to 
presence or absence of a figure in a noisy field. In this case, the detection 
experiments discussed in Sections 7.4 and 8.4 may be employed, with little 
ambiguity. In the case of organized environments, however (c.f., illustration 
in Section 8, 4) it is frequently difficult to decide on an a priori basis that a 
particular decision is ''right" or ''wrong". If the field is sufficiently ambiguous, 
or the context is not indicated, a particular pattern of lines might represent the 
letter ''E" or a random pattern of cracks on concrete. To evaluate performance 
on such material, it may be helpful to run the same experiment with human 
subjects, to provide an arbitrary standard for comparison. The results, however, 
are always subject to interpretation, based on the intentions, experience, and 
additional information available to the human observers in contrast to the 


perceptron. 
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Three types of criterion experiments seem possible: 


1) Description of the figure by a multi-response perceptron (e.g., 


"small right triangle in upper left, with cross-hatched surface''). 


2) Detection of familiar objects; perceptron is trained to tell 


whether object is present or absent. 


3) "Test-point experiments" where the perceptron can attend selec- 
tively to a test-point, or the end of a pointer placed in the field, 
and tell whether or not the point is in contact with the figure. 

In this way, a description of the figure can be obtained by trac- 


ing its boundaries, or obtaining an inventory of its parts. 


Little work has been done, to date, to determine the capabilities of 
cross-coupled and back-coupled perceptrons in experiments of these types. 
The detection experiment is the one most readily performed with the systems 
analyzed to date, and it is hoped that some data can be obtained in the near 
future. Series-coupled perceptrons appear to offer little hope of good perform- 


ance in these problems. 


Cross-coupled perceptrons have been observed to form mutually 
exclusive ''cell assemblies" in their association systems, under the spontaneous 
organization rules considered in Chapter 19. It is possible that with a suitable 
choice of preconditioning sequence and network parameters, such cell assemblies 
may be related to figural organizations, so that when two or more rival figure- 
ground organizations are present, the A-units activated will correspond to one 
of these organizations in preference to the others. At present, however, this 
conjecture must be regarded as pure speculation, with no real evidence to 


support it. 


The introduction of back-coupling, however, does permit the percep- 


tron to take advantage of the first and most powerful cue as to figural organization, 
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namely, differential motion. A suitable organization is illustrated in Figure 68. 
The perceptron is a three-layer system with multiple R-units, of the "on-off" 
variety. Each R-unit is trained to respond to a different motion, or transform- 


FIXED VALUES 
DISTRIBUTED 7; ; 


Figure 68 A PERCEPTRON FOR FIGURAL SEPARATION OF MOVING PATTERNS. 


ation. The variable connections from A to R-units and from R to A-units are 
reinforced as in Chapter 21, for selective attention systems with variable back- 
coupling. Due to the spectrum of time delays, the A-units respond directly to 
the movement pattern as well as the shape of the stimulus. The system may be 
further improved by adding inhibitory interconnections between the R-units, so 
that only one can go on ata time. If there should be two stimulus patterns 
simultaneously present on the retina, moving in opposite directions (or one 
moving and the other stationary), the dominant response will tend to support 
those A-units responding to the stimulus whose motion corresponds to the R- 
unit, and will suppress the A-units responding to the second stimulus. The 
threshold servo plays the same role as in the systems of Chapter 21. If the 
A-system is cross-coupled, with a .°-system rule, the effect will be supported 
by the formation of ''cell assemblies" characterizing different directions or 


velocities of motion. 
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As the stimulus field becomes increasingly ambiguous in its 
organization (as in ink-blot patterns, for example) the field organization which 
results in a human observer depends less on a passive response to automatic 
mechanisms, and more on an active construction’ of a meaningful figure. In 
-this process, a number of alternatives may be reviewed in quick succession, 
before one of them "settles in'', and the field loses its ambiguity. This sort 
of active structuring of the field may also be possible for a perc eptron with 
feedback loops from the R-units, if the perceptron can evaluate the strength, 
or decisiveness of its response, and actively perturb its response state (and 
hence the feedback signals to the A-units) until a strong, persistent response is 
obtained. This may be done by adding random Gaussian noise signals to the 
inputs of the R-units, resulting in frequent changes in the response state as long 


as the signals from the A-system are weak and indecisive. 


While the above discussion indicates several possibilities which are 
open to experimental treatment, it is clear that much fundamental groundwork 
remains to be completed before the problem of figural unity can be attacked in 
a systematic manner. At the present time, this problem remains one of the 


most severe challenges to all theories of brain mechanisms. 
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25. VARIABLE-STRUCTURE PERCEPTRONS 


All of the memory mechanisms erftployed in previous chapters 
employ a fixed network structure, in which the weights of connections are 
variable. It is occasionally proposed that a system in which the structure of 
the network itself is modifiable, with new connections being formed and old ones 
discarded on the basis of demonstrated utility, might lead to the evolution of a 
better model, with a smaller number of logical elements than would be possible 
for a fixed-structure perceptron with random connections, This might, for 
example, be a way of evolving special-purpose stimulus analyzing mechanisms 
of a high degree of utility for a particular environment, A model in which 
estructural modification is possible -- i.e., in which the origins or termini of 
connections are changed as a result of activity -- has previously been referred 
to as an evolutionary model". Apart from the possibility that such a system 
might provide a useful memory mechanism, or adaptive mechanism, it has been 
suggested that by observing the terminal states to which such a model goes, 
after long exposure to an environment, we might learn something about the 


kinds of physical constraints which could be usefully built into future systems 


at the outset. 
25.1 Structural Modification of S-A Networks 


To date, very little work has been done with evolutionary systems. 
Several examples have been programmed for the IBM 704, which indicate a 
slight improvement in some cases, but these programs have proven too costly 
in computer time to permit extensive experimentation. The cases illustrate d 
here come from this group of pilot experiments.* A three-layer perceptron 
with a single R-unit was employed, and an ~ -system error correction method 


was used for reinforcing the terminal network. 


* 
The programs were written by Kesler, and carried out at the AEC/NYU 
Computing Center. 
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The rules for changing the structure of the network are closely 
analogous to those employed for perceptrons with variable S-A connections, 
in Chapter 13. Each A-unit, a; , is continuously evaluated by means of a utility 
measure, £;. If the current response ¢” is wrong, £; may be increased by 1 


with probability Py + Py» OF Py» defined as follows: 
P, = probability of incrementing £; if the sign of 7}, disagrees 
with the desired classification of the current stimulus, and 


@; is active. 


f, = probability of incrementing £; if the sign of +;, agrees with 
the desired classification of the current stimulus, and @; 
is inactive. 

PP, = probability of incrementing £; if the sign of v;, disagrees 


with the desired classification, and a; is inactive. 


The quantities £; are assumed to decay by an amount @£; at each 
stimulus presentation time. If £; reaches or exceeds a threshold level, 0, , 
the origins of all connections to unit @, are reassigned, and £; is reset to sero. 
In most experiments, P, > P_e>Pqs 80 that an A-unit is most likely to have its 
connections changed if the value of its output signal frequently disagrees in 


sign with the intended classification of the stimulus which activated the unit. 


The results of several experiments (on horizontal/vertical bar 
discrimination) are shown in Figures 69 and 70, with the perfar mance curves 
for the corresponding fixed-structure models shown for comparison. While 
there seems to be a slight advantage for the variable-structure systems 
(particularly in Figure 69, where only 20 A-units were used), the improvement 
over the fixed-structure system is not impressive. Nonetheless, it is possible 
that a more sophisticated procedure for determining which A-units are to be 
changed would produce better results. It also seems likely that the horizontal / 


vertical bar problem, which is not very demanding in the geometry of origin 
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Figure 69 EVOLUTIONARY MODEL, IN HORIZONTAL/VERTICAL BAR DISCRIMINATION. MEANS 
OF 10 PERCEPTRONS. SO A-UNITS, X= 86, y= 2, O= 3, 
A= .9, A-.39, A, = .0l. 
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Figure 70 EVOLUTIONARY MODEL, IN HORIZONTAL/VERTICAL BAR DISCRIMINATION. MEANS | 
OF 10 PERCEPTRONS, ZERO RESPONSES COUNTED WRONG. 20 A-UNITS, X% = 8, 
¥ = 2, 6= 3, G,= 3, P= .9, A= .8, Pez Ol. 
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configurations required for discrimination, may be a poor choice of a calibration 
experiment for evaluating the evolutionary model. Unfortunately, the procedure 
is so time-consuming for a digital computer that only a small number of experi- 


ments have proved feasible. 


As a memory process, the above system seems excessively compli- 
cated. Not only are three distinct probabilities required, under three sets of 
logical conditions, but the £; must be stored as an auxiliary variable for each 
A-unit. This is clearly implausible for a biological mechanism. The difficulties 
encountered seem to be common with those met in all attempts at providing a 
useful memory process which operates on the preterminal connections of the 
network (as in the variable S-A systems of Chapter 13). It is hard to see what 
simple criterion might be employed to identify those connections which should be 
changed in order to improve the final output of the R-units. It seems likely that 
a local information rule (Page 289) is incompatible with an efficient system of 


reinforcement at the preterminal levels of the network. * 


25.2 Systems with Make-Break Mechanisms for Synaptic Junctions 


A somewhat different kind of structural modification from the model 
described above is that in which there is a fixed set of ''potential connections" 
to each unit, but these connections may be either ''made" or "broken" on an 
all-or-nothing basis, in the manner of switches or mechanical relays. A 
possible application of such a mechanism to the terminal network of a three- 
layer perceptron is illustrated in Figure 71. The A-units are divided into a 
set of excitatory units (E-units) whose output is always positive, and a set of 
inhibitory units (I-units) whose output is always negative. All signals are of 
unit amplitude, and the connections from I-units to the R-unit are fixed, only 
the E-unit connections being modifiable. The connections from E-units to the 


R-unit are of the make-break variety, the reinforcement rule being as follows: 


There is some hope, however, that the elastic perturbation" system suggested 
in Section 26.4 will prove applicable to this problem. 
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The reinforcement cortrol system can call for a 4v>QO or for 
Avw<0O. Ifa positive increment is required, excitatory connections with active 
origins are made with probability P (applied independently for each unconnected 
E-unit), while if a negative 4a is required, excitatory connections with active 
origins are broken with probability /. If the system begins with initial conditions 
such that the number of connected E-units just balances the number of connected 
I-units, and if the number of units is very large, the effect of a single reinforce- 
ment will be identical to the application of a quantized /-system reinforcement 
to a system with fixed A-R connections. Thus, under the error correction 
procedure, this system can be expected to duplicate the performance of an 


‘y¥ -system perceptron quite closely, provided the number of A-units is large. 


MAKE~BREAK 
__ EXCITATORY CONNECTIONS 


~e 
~ 
= 


FIXED INHIBITORY 
CONNECT IONS 


Figure 71 SIMPLE PERCEPTRON WITH MAKE-BREAK CONNECTION SYSTEM. 


An alternative system is one with equal numbers of E and I-units, 
in which the I-connectiors are also variab.e. In this case, new connections can 
be made, but once established are assumed to be permanent. For A710, new 
E-connections are formed with probability 2 , as above. For Av<0, however, 
new I-connections are formed with probability P , instead of breaking E-connec- 
tions. At the outset, assuming that all A-units are initially disconnected, this 
system again behaves in much the same way as an &-system perceptron. As 


the system "saturates'', due to the exhaustion of available connections, the 
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increments to the R-unit input signal from each new reinforcement become 
progressively smaller. If the number of A-units is infinite, then the system 
never saturates entirely, new reinforcements always having some effect, 


although this is apt to become negligible as saturation is approached. 


These models are of more interest as possible analogs for biological 
systems than as significantly new types of perceptrons. Their properties, 
short of the saturation condition, closely resemble the systems previously 
considered, but they do not require values which change sign, and are sugges- 
tive of a possible synaptic growth mechanism in biological memory. As engineer- 
ing devices, their reliance on probabilistic mechanisms is apt to make their 


construction more difficult than the «<-system models. 
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26. BIOLOGICAL APPLICATIONS OF PERCEPTRON THEORY 


When the perceptron was first proposed, it was considered 
primarily as a model of biological memory mechanisms. As the models 
became more sophisticated, a number of psychological properties not directly 
related to memory were investigated, but the main emphasis, as a biological 
model, is still on the adaptive mechanisms employed, and the recording of 
past experience. In this chapter, the application of perceptron theory to 


biological problems will be considered primarily from this point of view. 


26.1 Biological Methods for the Achievement of Complex Structures 


The biological evidence which has been cited repeatedly throughout 
this volume indicates that highly organized structural constraints exist in 
many parts of the nervous system. Apart from the gross anatomical complexi- 
ty of the brain, the mechanisms of optic nerve growth and regeneration, the 
stimulus analyzing mechanisms found by Lettvin in the frog and by Hubel in the 
cat, and the better known mechanisms of motor coordination and control 
indicate that organization of a rather involved type may occur even in the fine 
structure of the network. In perceptron theory, as it has developed to date, 
most emphasis has been placed on learning and memory as a means of 
achieving such organization. In actuality, a number of alternative procedures 
are possible for the creation of complex networks, satisfying a given set of 


logical constraints. These include: 


th 
1. Logical specification (e.g., let the ral cell of the k row be 
connected to the itlst cell of the k+3rd row, for all i>k). 


This is equivalent to an exact blueprint of the network. 


2. Natural selection, whereby the useful sub-networks of an 


originally random population survive, while the others decay. 


3. Simple spatial constraints (gradients, directional bias, or 
distributions of connections specified by a small number of 


parameters). 
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4. Typological constraints (e.g., cells of Type A can only connect 
to cells of Type B or C, where cell types might be distinguished 


by chemical properties). 


Of these four mechanisms, only the last three seem to be well 
suited for the development of biological nerve nets. The first mechanism, 
logical specification of the structure, is primarily a contrivance of engineer- 
ing, which is well suited to the construction of computers, but which seems to 
have no clear counterpart in known mechanisms of growth and maturation. 

It is this first method of control, however, which has been most investigated 
in studies of brain mechanisms during the last few decades (e.g., References 


17, 57, 71). 


In specifying the initial physical form of the networks in perceptron 
theory, most attention has been given to the third alternative; spatial constraints 
of a simple sort have been employed throughout. In the last chapter, limited 
use was made of the second and fourth methods. The use of typological 
constraints has thus far been used mainly to distinguish excitatory from 
inhibitory neurons (Section 25.2), but it seems likely that its use is relatively 
widespread in biological systems. In particular, Sperry's work on neural 
maturation and fiber regeneration, and Lettvin and Maturana on the regener- 
ation of scrambled connections in the frog's brain, suggest a chemical control 


or ''homing mechanism" of remarkable sensitivity. 


The limited experiments performed thus far on "natural selection’ 
as a structural control mechanism do not appear particuarly promising 
(Section 25.1). The evolution of the network occurs too slowly, and is too 
subject to disruption and instability of partially achieved organizations, to be 
useful in any of the forms examined up to this point. It remains passible, 
however, that a more rapidly converging mechanism may be found, and the 


field remains open for future investigation. Typological constraints, on the 
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other hand, are likely to come into their own with the investigation of 
perceptrons having complex mixtures of property detectors, and other 


specialized A-units, all deriving their connections from a common sensory 
field. 


26.2 Basic Types of Memory Processes 


Perceptron memory mechanisms have all taken the form of 
modifications of the signals transmitted across synaptic junctions. There 
appear to be at least two basic types of memory dynamics which are useful 
in perceptrons. The first is a system in which values remain stable unless 
action is taken by a reinforcement control system, based upon an evaluation 
of the current response of the perceptron. The most effective method actually 
investigated for this purpose has been the @-system, with an error correc- 
tion procedure for modifying the values of A to R-unit connections. The 
second type of memory is one which achieves stability only in the form of a 
dynamic equilibrium with a continuously active reinforcement process. This 
second system does not depend upon evaluation of the perceptron's output, but 
maintains a continuous state of adaptation in the network, based only on local 
activity. In practice, it seems likely that a decaying / -system will prove 
to be the best of the systems of this type which have been analyzed. The 
first type of mechanism permits the system to learn from an external 
"teacher", or by reward and punishment experienced as a result of trial and 
error activity. The second type permits the perceptron to acquire an internal 
model of the similarity structure" of its environment, as defined by the temporal 
relationships of moving stimuli. It may be that more complex forms of organi- 
zation (such as the recognition of connected patterns, or Gestalten) can also 


be achieved by means of dynamic processes of the second type, but this remains 


conjectural at this time. 
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While it is certainly conceivable that additional basic mechanisms 
may be required to perform the mcmory tasks of a complex organism, there 
seems to be some reason to believe that the two types of dynamics character - 
ized above may prove sufficient for the phenomena of "adaptive behavior". 

The first variety permits the system to be ''set'' passively to any desired state, 
which will tnen be retained indefinitely. Thus any form of permanent learning 
can be handled, in principle, by sucha system. The error correction theorems 
of Chapters 5 and 10 seem sufficient to demonstrate this assertion. On the 
other hand, any spontaneous modification process which is not to be self- 
defeating must ultimately achieve some sort of dynamic equilibrium with the 
conditions which induce the change in state; without such a mechanism (provided 
in the case of our four-layer and cross-coupled perceptrons by the decay term 
in the equations) the dynamic range of the memory variables must ultimately 

be exhausted, and the system will saturate. In any case, a mechanism which 
is to serve as a basis for generating a model of the external environment must 
be one which ultimately approaches a stable condition, as the model approaches 
a true representation of the external world. Such considerations make the 


second mechanism appear to be a natural complement to the first. 


Two memory functions which might call for processes of a different 
logical character are the serial recording of experience (in the manner of a 
tape recorder or motion picture camera) and a temporary memory for data which 
are to be used in the immediate future and then forgotten (as in the "'memory'' 
of a digital computer). For the second of these phenomena, it is likely that a 
dynamic storage mechanism, such as pools of activity or reverberating loops 
which can be triggered and extinguished by a suitable control system, will prove 
to be the most effective storage mechanism. The problem of serial memory is 
a more serious one, but can only be dealt with together with the problem of selec- 


tive recall and the mechanisms for its control. 
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It is certain that in a simple perceptron, memories are not tagged 
in any way which would permit their serial order to be re-established later. 
But the ''memories" stored in a simple perceptron are in any case merely 
associative, rather than substantive. The nature of substantive memory in 
humans must be investigated more carefully in the future. While it seems 
unlikely that a complete image or state of the association system is stored, it 
is nonetheless clear that a great deal more information is retained than is 
represented by a simple classification of an experience as belonging to one of 
mn categories. One alternative is that of storing a description of a large number 
of characteristics or dimensions, which jointly permit the reconstruction of 
the original experience by the active creation of a model, or image, which 
approximates the original state of the association system. Among the charac- 
teristics stored would be such time-tagging information as the location in which 
the event occurred, the time of day, the activity that the subject was engaged in, 
etc. An accumulation of such cues would enable a suitable search process to 
locate the experience in time, and to associate it with preceding or successive 
events in appropriate order (c.f., Reference 79, Chapter VIII). In any case, 
it seems likely that substantive recall is an active, creative (or recreative) 


process, rather than merely a passive reading-out of a memory bank. 


26. 3 Physical Requirements for Biological Memory Mechanisms 


From the considerations just stated, it should be clear that not one 
but several memory mechanisms are likely to be encountered in a complex 
system. Limiting our attention, for present purposes, to the two basic mechanisms 
which have been studied in perceptrons, what can we say as to the probable 


physical characteristics of the memory traces? 


First, as to location: it appears that the most suitable location is in the 
connections, or synapses, which mediate the interaction of particular pairs of 


neurons. Perceptrons in which the memory trace affects an entire neuron and 
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all of its interactions with other neurons have been investigated (Reference 79) 
but this has invariably involved the introduction of artificial constraints on the 
topology or logic of the network, in order to limit the effects of reinforcement 
to the desired transmission channels. In any case, systems in which the re- 
inforcement is specific to the connections appear to be far more economical 


than those in which reinforcement is applied to an entire neuron, or A-unit. 


A second condition is that the memory change should be reversible. 
Both the externally controlled error-correction procedure and the fully automatic 
memory processes of the cross-coupled perceptrons require reversible modifica- 
tions. In the case of the error-correction procedure, two antagonistic control 
mechanisms seem to be called for, one of which strengthens the excitatory outputs 
of active A-units, and the other of which weakens excitatory outputs or strengthens 
inhibitory outputs. While most of our analyses have assumed that the actual sign 
of the value of a connection may change from positive to negative, this 1s clearly 
a non-biological artifact, introduced for convenience in analysis. The same effects 
could be achieved by a system in which half of the connections are always positive, 
and half are always negative. If the negative connections are fixed in magnitude, 
then only the excitatory connections need be modified, yielding a net positive 
signal if they exceed the strength of the fixed inhibitory component, and a net 
negative signal if they fall below the inhibitory strength. Alternatively, the 
excitatory connections might be fixed, and the inhibitory connections variable, 


or each type might be variable within its own dynamic range. 


The requirement that the ''strength" or value of a connection be 
modified as a consequence of the correlated activity of both terminal units, 
rather than just the transmitting unit, appears to place a unique condition on the 
memory process. Most metabolic processes such as growth, changes in cell 
chemistry, etc., which might be involved here are of a type which generally depend 


only upon the cell in which the change occurs, and its over-all environment, 
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whereas we seem to require a two-factor phenomenon, which depends upon the 
activity of two specific cells. This writer has previously stated the conjecture 
(Reference 83) that the required effect might be obtained if the production of 
transmitter substances depended upon an enzyme or catalyst produced in the 
nucleoplasm of the trans-synaptic cell, and released to the medium when that 
cell is stimulated to activity. The presynaptic fibers which were most recently 
active, being in a heightened metabolic state, would then be in the most favorable 
position to compete for the limited supply of this catalyst, which would then 
enable them to produce their transmitter substance at an increased rate in the 
future, The competition for metabolites in limited supply in the neighborhood 

of a particular cell body would tend to create a 7 -system, in which the most 
active cells would gain at the expense of the inactive ones. Whether this is a 
correct description of the mechanism or not, some type of symbiotic relationship 
seems to be demanded between the presynaptic fibers and the post-synaptic cell, 


in order to provide a memory mechanism of the type analyzed in Part III of this 


volume. 


The memory mechanism employed for error-correction learning 
places rather different demands on the biological system. Here the reinforcement 
depends not so much on the correlation of activity of the two terminal units, as on 
the correlation of the activity of the transmitting unit with the decisions of the 
reinforcement control syatem. It is conceivable that this might again involve 
the release of a catalyst in the neighborhood of the active connections, but in this 
case the release must be remotely controlled -- perhaps through glandular action. 
In one respect this is a simpler requirement, conceptually, than the former case, 
where the activity of two specific cells had to be considered for each connection 
which might be reinforced. In the present case, the general release of an 
excitatory or inhibitory reinforcing agent from a central source would appear to 
be sufficient; the recently active connections, being most metabolically active, 


would tend to be most strongly affected. Ina second respect, however, this 
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mechanism presents a new problem which is more serious: the problem of 
limiting the effect of reinforcement to the specific response which is to be 


corrected. 


It was demonstrated in Chapter 12 that the error correction procedure 
can be guaranteed to work only if the correction is limited to the erroneous 
responses, in a multiple response system. To achieve this condition in a biologi- 
cal system, it seems that a mechanism is called for which can select one response, 
or response component, at a time as a candidate for reinforcement, and limit the 
corrective action to the selected locality. In dealing with motor responses, the 
topographical mapping of the motor control areas of the cortex is likely to prove 
helpful here, particularly if we adhere to the hypothesis that the memory trace 
involves the release of a chemical agent which affects everything in its neighbor - 


hood. * 


The proportional decay mechanism which is required for the ''spontane- 
ous'' memory process is probably the easiest of the requirements to rationalize 
in a biological model; a chemical mechanism, in particular, would tend to exhibit 


decay at a rate which increases with the concentration. 


At present, any treatment of the compatibility of perceptron theory 
with biological memory mechanisms must remain entirely speculative. It is 
to be hoped that as additional evidence on synaptic transmission and neurochemistry 
comes to light, it can be fitted into the picture. Thus far, there seem to be no 
serious conflicts, although there are a number of missing links. The considera- 
tions stated above do suggest several plausible hypotheses for experimental 


investigation. 


"A procedure is now being investigated by which an error correction is applied 
to a randomly chosen set of R-units, the value increments being transient 
rather than permanent, unless the correction actually proves effective. It is 
hoped that this technique will yield an efficient reinforcement mechanism which 
does not depend on specification of the erroneous R-units. (see Section 26. 4) 
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26.4 Mechanisms of Motivation 


The problem of motivation for perceptrons, considered as models 
for biological nervous systems, has hardly been treated adequately up to this 
time. The reinforcement control system, which forms part of the experimental 
system, plays the role of a sort of deus ex machina, which not only has know- 
ledge of right and wrong responses, but can control the distribution of re- 
inforcement to individual R-units in the perceptron, as required. A more 
natural" system with only a slight reduction of efficiency does seem to be 
possible, however, although at present the model proposed is a heuristic one, 


on which no quantitative analysis has been completed. 


The proposed model for biological reinforcement mechanisms is 
illustrated in Figure 72. Inthis system, the r.c.s. is no longer external to 
the system, but is essentially part of the perceptron. It is assumed that the 
perceptron system includes a sensing device for a physiological condition 
which has been arbitrarily called the ''discomfort level'', measured by the vari- 
able D. This might be compared to Ashby's concept of ''essential variables". 
In addition to continuously measuring the variable D, which is assumed for 
simplicity to be some function of the current stimulus pattern, a second 
mechanism (readily represented by a neuron with inhibitory input connections 
with a short time delay and excitatory connections with a longer time delay, 
both originating from the ''D-detector'') responds to a negative dD/dt. The 
corrections to this system are random perturbations applied either to active 
connections, or to all connections of the perceptron; the increments, however, 
take the form of "elastic perturbationg', so that the connections tend to decay 
back to their previous values unless a positive reinforcement" occurs to ''fix'! 
the new values. Thus negative reinforcement applies a slight random perturba- 
tion, which tends to disappear unless it actually proves helpful, in which case it 


is stabilized by a positive reinforcement. 
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Figure 72 EXPERIMENTAL SYSTEM EMPLOYING ELASTIC PERTURBATIONS, STABIL!ZED 
BY IMPROVEMENTS IN SENSORY SITUATION (COMPARE Figure 4). 


For this system to function efficiently, it is again necessary to 
asaume some degree of temporal continuity in the environment, so that the 
change in D indicates a true improvement in the response of the system, rather 
than an irrelevant change due to a sudden alternation of the environment. 
Preliminary simulation experiments to evaluate this scheme are now in progress, 
employing the Burroughs 220 computer, and indicate that the system should work 
with a reasonable degree of efficiency, as compared to a system employing a 
more deterministic error correction procedure, The results of these experi- 
ments will be reported as soon as the data are complete. The system has the 
advantage that it works well with an arbitrarily large number of R-units, 
without requiring an individual decision as to the error of each one, as long as 
D is some monotone increasing function of the joint error, such as the norm of 
the difference vector, | p*. r*|| . Such a representation will work best when all 
of the R-units are continuous transducer units, so that any random value- 


perturbation will have a 0.5 probability of yielding an improvement. 
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27. CONCLUSIONS AND FUTURE DIRECTIONS 


Man's intelligence is a unique phenomenon on our planet, occurring 
at such a level of complexity in a single species only. The lack of other 
similarly intelligent species is unfortunate from the standpoint of science, 
for it makes it difficult to tell from comparative evidence which features of 
human psychology are accidental products of man's peculiar biological constitu- 
tion, and which are fundamental to the nature of intelligence itself. Despite 
this lack of comparative material, some of us believe that it may ultimately be 
possible to answer such questions through an understanding of the physical basis 
of psychological phenomena, independently of the biology of any one species. The 
perceptron program represents a small part of such an undertaking; it is an 
attempt to study the psychological properties of certain highly simplified mathe- 
matical or physical models of the central nervous system, in the hope that such 
a study may throw light on basic principles which can then be applied to more 


sophisticated models. 


The use of ''models'' to represent complicated natural phenomena 
has been an essential technique in the physical sciences for many centuries. 
The model is a simplified theoretical system, which purports to represent the 
laws and relationships which hold in the real physical universe. The solar 
systems of Ptolemy, Copernicus, and Einstein, and the Atomic models of 
Democritus, Bohr, and Heisenberg represent two successions of such models, 
each in turn coming somewhat closer to an adequate representation of its subject 
matter. In some cases (the concept of an ''ideal gas'' for example) the model 
deliberately neglects certain complicating features of the natural phenomena 
under consideration, in order to obtain a more readily analyzed system, which 
will suggest basic principles that might be missed among the complexities of 
a more accurate representation. Such simplified models may then be refined 
through a series of ''perturbations", which introduce the known complications 


one at atime, in a manner which permits the mathematician to incorporate them 


-573- 


Google 


into his analysis. It is this approach which has been most characteristic of the 


perceptron program. 


Stated in simplest terms, our objective has been to discover a 
physical system, or abstract model, which will be capable of "perceiving" its 
environment, and learning to recognize those objects or events which it has 
perceived in the past. However, since it is our purpose to understand the actual 
mechanisms employed by the brain, rather than simply to construct a new type 
of computing device, the perceptron models are constrained in their organization 
and dynamic properties by what is known of the biological nervous system. Rather 
than attempting to "invent" or ''construct'' a machine which will calculate such 
things as similarities or geometrical properties of stimuli, the approach has 
been to begin with a hypothetical network of idealized neurons, or nerve cells, 
resembling the brain in its general organization, and then analyze the system 
mathematically to determine whether or not it possesses "psychological" 
properties of interest. Where the model is found to deviate markedly from the 
behavior of biological systems, modifications are suggested, and the new model 
that results is subjected to the same sort of analysis. In this fashion, it is hoped 
that the necessary conditions for a system to 'perceive'' in the same manner as 


the brain can be abstracted. 


In this chapter, we will attempt to summarize the principle results 
which have thus far emerged from this approach, the problems which have now 
come to the foreground, and the means by which these problems might be attacked. 
The possible applitations of perceptron theory to engineering devices and the 
construction of physical brain models will also be considered. Finally, an attempt 
will be made to anticipate the future relationship of the neurodynamic approach to 
the various alternative strategies by which the problems of understanding and 


simulating intelligence are being investigated. 
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27.1 Psychological Properties in Neurod ic Systems 


Our main conclusions deal with the properties of closed experimental 
systems, such as those illustrated in Figures 3, 4, and 72. It has been shown 
that as the topological organization of the perceptron increases in complexity, 
new psychological properties emerge. The principle results can be summarized 


as follows: 


(1) A network consisting of less than three layers of signal transmission 
units, or a network consisting exclusively of linear elements connect- 
ed in series, is incapable of learning to discriminate classes of 
patterns in an isotropic environment (where any pattern can occur 


in all possible retinal locations, without boundary effects). 


(2) A three-layer series-coupled perceptron is a minimal system capable 
of learning to discriminate arbitrary classes of stimulus patterns 
or stimulus sequences. Any discrimination problem can, in princi- 
ple, be solved by such a system, and any arbitrary response function 


can be assigned to the stimuli of a given universe. 


(3) By means of an o& -system error-correction procedure, a three- 
layer series-coupled perceptron with simple A-units and a fixed 
preterminal network can always be taught the solution to any classi- 


fication problem or response function for which a solution exists. 


(4) Equations for the learning curves of simple perceptrons under 
various reinforcement rules have been presented. The results 
indicate that for simple tasks, such as the recognition of large 
alphabetic characters against a plain background, the three-layer 
series-coupled system performs with reasonable efficiency, 
although it may require a lengthy training procedure with large 
samples of each stimulus class to guarantee recognition of all 


variations, or 'allomorphs'" of a pattern. 
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(5) 


(6) 


(7) 


(8) 


(9) 


(10) 


In perceptrons with variable-valued preterminal networks, a non- 
deterministic reinforcement rule may be required to guarantee that 
the solution to a classification problem will be achieved, given that 


the solution exists. 


Generalization capabilities of three-layer series -coupled systems 
are poor, and in ''pure generalization experiments (where the test 
stimuli have no sensory points in common with the training stimuli) 


there is essentially no generalization capability. 


Series-coupled perceptrons with randomly organized origin~-point 
configurations for the A-units tend to be highly resistant to stimulus 
noise and network damage; in a complex field containing mixtures of 
familiar stimuli, however, they are easily confused, and are incapable 


of responding selectively to one stimulus or object at a time. 


The addition of a fourth layer of signal transmission units, or 


.cross-coupling the A-units of a three-layer perceptron, permits 


the solution of generalization problems, over arbitrary transform - 


ation groups. 


Four-layer and cross-coupled systems with suitable rules for 
modifying their connection values (Chapters 16, 17, and 19) are 
capable of learning a group of transformations which have occurred 
commonly in sequences of stimuli, and later recognizing the 
similarity of stimuli which are equivalent under the observed 
transformation group. This phenomenon occurs ''spontaneously'', 
without any external influence on the perceptron apart from the 
occurrence of stimuli. 

In back-coupled perceptrons, selective attention to familiar objects 
in a complex field can occur. It is also possible for such a perceptron 
to attend selectively to objects which move differentially relative to 


their background. 
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(11) By a suitable combination of geometric constraints (Chapter 23) 
a multi-layer perceptron can be enabled to recognize detailed 
patterns in high-resolution fields with markedly increased efficiency, 
compared to a randomly organized three-layer system. For a given 
universe of stimuli, there will be an optimum organization of such 
a system, which will rarely exceed three layers of A-units for 
tasks commensurate with human capabilities under tachistoscopic 


conditions. 


(12) A number of speculative models which are likely to be capable of 
learning sequential programs, analysis of speech into phonemes, 
and learning substantive ''meanings'' for nouns and verbs with 
simple sensory referents have been presented in the preceding 
chapters. Such systems represent the upper limits of abstract 
behavior in perceptrons considered to date. They are handicapped 
by a lack of a satisfactory ''temporary memory", by an inability to 
perceive abstract topological relations in a simple fashion, and by 
an inability to isolate meaningful figural entities, or objects, 


except under special conditions. 


The capabilities which are outlined above, and the variety of networks 
and dynamic principles considered, map out a substantial territory, much of 
which still remains to be explored in detail. While rudimentary perceptual 
behavior appears to be present in these systems, it seems likely that to deal 
adequately with the problems of complex perceptual fields and the recognition 
of abstract relations between objects or events, additional principles must 


still be found. 


27.2 Strategy and Methodology for Future Study 


A number of perceptrons analyzed in the preceding chapters have 


been analyzed in a purely formal way, yielding equations which are not readily 
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translated into numbers. This is particularly true in the case of the four-layer 
and cross-coupled systems, where the generality of the equations is reflected 

in the obscurity of their implications, except for the few cases where explicit ex- 
amples have been worked out. For other models, only qualitative results are 
available, although the way is clear for quantitative work to be initiated. Those 


problems which appear to be foremost at this time include the following: 


(1) Theoretical learning curves for the error correction procedure. 
(At present, only empirical results are available, and no 


attempts at theoretical analysis have proven successful. ) 


(2) Determination of the probability that a solution exists to a 


given problem, for a perceptron drawn from a specified class. 


(3) The development of optimum codes for the representation 
of complex environments, in perceptrons with multiple R- 


units (see Section 12. 2). 


(4) Development of an efficient reinforcement scheme for pre- 


terminal connections (c.f., Chapter 13). 


(5) Optimum organization of stimulus analyzing mechanisms and 
networks with geometrically constrained connections (c.f., 


Chapter 23). 


(6) Terminal performance of cross-coupled and four-layer percep- 
trons in generalization experiments, as a function of network 
parameters, reinforcement dynamics, and environment 


characteristics. 


(7) Theoretical analysis of convergence-time and learning curves 


for adaptive four-layer and cross-coupled perceptrons. 


(8) Quantitative studies of effects of threshold servos on system 


performance (c.f., Chapter 21). 


-578- 


Google 


(9) Quantitative studies of speech recognition and phoneme analyzing 


systems. 


(10) Performance of back-coupled systems in selective attention and 


detection experiments. 


(11) Quantitative studies of sequential program learning in back- 


coupled systems. 


(12) Effect of spatial constraints in cross-coupled systems (e.g., 
limiting interconnections to pairs of A-units with adjacent 
retinal fields). 


(13) Studies of possible figure-segregation (figure-ground) mechanisms. 


(14) Studies of abstract concept formation, and the recognition of 


topological or metrical relations. 


(15) Biological memory mechanisms, and studies of neurophysiology 


in relation to perceptron theory. 


Four basic techniques are available for the study of these problems: 
theoretical analysis, digital simulation, the construction of physical maiels, 
and physiological experimentation. The first two problems of the above list 
are specifically mathematical in character. The third, while posed as a 
theoretical question, might best be investigated at the outset by means of simu- 
lation studies. In the case of problems (4) and (5), simulation studies seem to 
be indicated for preliminary exploration, although it is hoped that some theore- 
tical formulations may ultimately be achieved. The sixth problem -- the 
determination of terminal performance of adaptive four-layer and cross-coupled 
systems -- calls in effect for a variety of explicit solutions to the steady-state 
equations presented in Part III. Such a program is currently being carried out 
both by direct computation of the equations and by simulation techniques. For 
the cross-coupled systems, simulation is likely to prove more economical in 


most cases than the numerical solution of the equations. The seventh question 
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again is a theoretical one, although preliminary results obtained from simulation 
programs should prove enlightening. The problem of threshold servomechanisms 


can be investigated both by theoretical means and by simulation. 


It has recently been proposed that an audio-perceptron should be 
constructed at Cornell University to study the problem of speech recognition. 
Since this is a problem in which the chief interest is in performance under 
typical environmental conditions, rather than in theoretical problems of pattern 
recognition (which have all been solved on paper, insofar as spoken inputs 
resemble any other form of sensory sequences), it seems best to provide for 
convenient input to a real-time system, rather than working with simulated 
perceptrons and samples of digitalized speech. The problem of phoneme analysis, 
however, still presents enough theoretical problems and uncertainty as to the best 
solution, so that a digital simulation program is indicated. The system proposed 
in Chapter 23 is now being investigated by this means. The problems of back- 
coupled systems referred to in (10) are probably also best referred to an actual 
physical model, although a certain amount of useful simulation can be performed 
in checking out the general theory before such a model is built. Problem (11) 
is also of this character. Problem (12) is again of the type which will yield most 
readily to simulation at this time. It is of interest in connection with possible 
figure-ground mechanisms, which are included in a more general way in Prob- 


lem (13). 


Problems (13) and (14) are primarily speculative in character, and 
must await new insight into possible mechanisms, the exact nature of which is 
not yet clear. It is hoped that studies of the other problems, which are all well 
enough formulated to be investigated directly, will suggest possible approaches 
to these two problems, which represent the most baffling impediments to the 
advance of perceptron theory in the direction of abstract thinking and concept 


formation. The previous questions are all in the nature of ''mopping-up’ oper- 
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ations in areas where some degree of performance is known to be possible, and 
where suitable mechanisms can be described, at least in qualitative terms; the 
problems of figure-grouad separation (or the recognition of unity) and topological 
relation recognition represent new territory, against which few inroads have been 


made. 


The last problem -- the correlation of perceptron theory with 
biological evidence -- represents at once an area of investigation in its own 
right, and a potential source of insights into solutions to the prior prablems. 
To date, little has been done to obtain relevant physiological data directly. 
Nonetheless, several hypotheses have been suggested (c.f., Chapter 26), and 
a great deal of useful work along the line of Hubel's studies of the cat cortex 


can be carried out using known laboratory techniques. 


27.3 Construction of Physical Models and Engineering Applications 
From a purely scientific standpoint, physical models of particular 


perceptron organizations seem to be indicated only for relatively advanced 
systems (such as the speech recognition, selective attention, and program 
learning perceptrons referred to above) where the theory is reasonably well 
known, but the actual quantitative behavior under realistic environmental 
conditions remains in doubt. In some cases, it may ultimately prove more 
economical to build a physical model than to simulate a highly parallel signal 
network on a sequential computer, Digital simulation, however, always has 

the advantage of greater versatility and adaptability to radical changes in design 
and dynamics of the simulated network, Its main difficulties are insufficient 
speed, insufficient high-speed memory, and difficulty of programming the 
simulation of complicated 'naturalistic'' environments required for some exeri- 
ments. This last disadvantage can be overcome by the design of special sensory 
input devices (such as audio analyzers and flying-spot scanners) for digital 


computers, and it is hoped that such equipment will be available in the near 
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future. While most problems can be investigated successfully in scaled-down 
versions using a computer comparable to the IBM. 704 or .7090, a problem 
occasionally occurs which places a severe strain on the capability of even the 
best digital equipment now available. The study of evolutionary models, and 
adaptation processes in cross-coupled systems appear to be of this variety. 

A special purpose digital computer (such as the Mark II design proposed by 
C.A.L.) may ultimately prove to be the most expedient solution to these 
problems, although the limits of useful simulation with conventional computers 
have not yet been reached. 


The construction of physical perceptron models of significant size 
and complexity is currently limited by two technological problems: the design 
of a cheap, mass-produceable integrator, and the development of an inexpensive 
means of wiring large networks of components. The Mark I (Frontispiece) 
employs motor-driven potentiometers for integrators, and a large patch-panel 
for connections - both intolerable solutions for very large systems. The 
integrator problem is currently being attacked by groups at Aeronutronic. 
and Stanford Research Institute, who have developed magnetic integrators which 
are suitable for alpha-system perceptrons, and at Cornell University, where an 
electrochemical system is under investigation. While these approaches seem to 
offer some hope of an "intermediate" solution to the problem, an ultimate 
solution is more likely to come from some of the solid state work and studies 
of microelectronics, such as the work of Shoulders at SRI (Reference 114). 

This last technique offers a potential solution to the interconnection problem, 
as well as a possible means of fabricating large numbers of digital integrators 


at low cost. 


Since the main emphasis in this volume has been on neurodynamic 
theory, rather than applications, little has been said about the engineering aspects 
of the field. It is clear that if the objective of a coherent theory of brain mechanisms 
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is achieved, it is likely to prove applicable to pattern recognition and control 
devices, as well as the development of advanced computing systems of many 
varieties. Preliminary studies have been carried out dealing with possible 
applications of perceptrons to photo-interpretation (Reference 116) and the 
recognition of events in bubble chambers (Reference 115). More abstract 
applications of the pattern recognition ability, such as the diagnosis of clinical 
syndromes or meteorological prediction, have occasionally been proposed, 
although little evidence has been accumulated regarding the relative suitability 
of perceptrons as opposed to more conventional techniques for dealing with such 
problems. The applications most likely to be realizeable with the kinds of 
perceptrons described in this volume include character recognition and "reading 
machines", speech recognition (for distinct, clearly separated words), and 
extremely limited capabilities for pictorial recognition, or the recognition of 
objects against simple backgrounds. ''Perception'' in a broader sense may be 
potentially within the grasp of the descendants of our present models, but a 
great deal of fundamental knowledge must be obtained before a sufficiently 
sophisticated design can be prescribed to permit a perceptron to compete with 


aman under normal environmental conditions. 


The most important technological development which may be inherent 
in the future development of brain models, would be the provision of eyes and 
ears" for conventional computers and automata, giving them a common universe 
of discourse with their operators. Current attempts at heuristic problem-solving 
programs (such as Newell and Simon's programs) and at automatic language 
translation, are hampered by a lack of common referents for symbols, which 
can be no more than code-numbers for the computer, but which have a wealth 
of associated meanings for the operator. The development of a system which, by 
virtue of shared sensory experience, can ''comprehend'" the nature of the physical 
referents in a descriptive statement, is probably a necessary first step to the 


creation of a truly useful problem-solving computer. Linguistic capability, related 
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to perceptual experience, is of the essence for an "intelligent" system, artificial 


or otherwise. 


27.4 Concluding Remarks 


The last four years have seen the development of perceptron theory 
from the study of a few primitive models to the mapping of a comprehensive 
field of investigation. In its present form, this theory is definitive only in its 
treatment of relatively simple systems, although a considerable number of more 
advanced systems are now understood at least in a qualitative fashion, and the 


way is now open to quantitative studies of well-defined problems. 


As advanced perceptron models become more sophisticated in their 
psychological properties, it becomes more appropriate to consider them as 
devices capable of performing arbitrary programs of observation, response, and 
manipulation of data. As this condition is reached, the methodology of perceptron 
studies is likely to merge with that of the "heuristic program" approach to 
psychological functioning, advocated by Newell and Simon (Reference 62). In 
such programs, goal-motivated behavior becomes the main object of study, 
whereas in perceptrons studied to date, the behavior is motivated primarily by 
the present environment and state of the eysterh. A merger of thase approaches will 
not only open up new territory, but will be a sign of the "psychological maturity'' 
of perceptron theory, inasmuch as it will permit the study of non-trivial prob- 
lems in the psychology of thinking and problem-solving, in terms of neurodynamic 


systems of known physical structure. 


On the other hand, the biological maturity" of neurodynamic theory 
must await the solution, or at least a more promising approach, to the biological 
memory problem. Once this is achieved, a fruitful interaction between percep- 
tron theory and neurophysiology can be expected; but the memory problem remains 


paramount in importance. 
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The theoretical approach presented in this volume is clearly a long 
way from an adequate "explanation" of the foundations of human experience. The 
work will have fulfilled an important purpose, however, if it has succeeded in 
conveying a recognition of the potential power of a mathematical study of neuro- 
dynamic systems, not only for understanding the physical mechanisms of the 
brain itself, but for comprehending the relationship of the cognitive process in 


rman to the nature of the environment in which it occurs. 
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APPENDIX A 
NOTATION AND STANDARD SYMBOLS 


1. Notational Conventions 


While the mathematical notation employed in this volume may still be 


capable of further improvement, several conventions have been established which 


appear to work reasonably well. They include the following: 


(1) 


(2) 


(3) 


(4) 


(5) 


Individual signal-units in the perceptron are referred to by a lower 
case letter to indicate the type, and a subscript to designate the 
particular unit in question (a, = i*A -unit). Individual stimuli are 
referred to by a subscripted capital (S-), while stimulus sequences 
are designated by script capitals (d;). 


Numbers of signal units are designated by a capital V , witha 
subscript to indicate the type of unit in question (W4,= number of A- 


units). The number of stimuli is indicated by a small 7. 


An asterisk is used to denote activity: a; = activity state (or 
output signal) of the unit a; ; No = number of active A-units; 


oi = signal transmitted by connection £;;. 


Sets of units may be designated either by a subscripted capital or 
by a functional notation. For example, the set of A-units respond- 


ing to stimulus S; may be designated either by A; or by A(S;). 


Where it is necessary to refer both to the unit receiving a signal 
and to the stimulus for which the signal occurs, a tensor notation 
is employed, with the signal unit indicated by a subscript and the 
stimulus by a superscript. For example, oct) = input signal to the 
i unit from the / ” stimulus at time @ . An obvious extension 
would permit this notation to be applied to origins as well as 


termini of signals; thus ot'y would designate the signal trans - 
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mitted to unit ; from unit « in response to stimulus 5, at 


time ¢. 


(6) Whenever pairs of subscripts are used to designate a signal or 
connection (as in <; j) the first subscript indicates the origin, 
and the second the terminus. In generalization coefficients 
(9;;), the first subscript indicates the "recipient’ and the 


second subscript indicates the "source" stimulus. 


(7) In multi-layer systems, the layers are counted separately for 
each type of unit, and the number of the layer may be denoted 
by a superscript in parentheses (e.g., ni? = number of units 
in the second association layer; r;” = é" R-unit of the third 


R-unit layer). 


Matrix and vector notations, where employed, follow usual conventions, 
the particular symbols being defined in the text where they appear. The symbol 
7, when it appears without subscripts, indicates a decay rate, and should not be 
confused with Kroneker's delta, which appears only with subscripts (d; ;), or with 


Dirac delta-functions, ‘(z), for which the functional notation is always used. 


2. Standard Symbols 


The following list includes those symbols which are used consistently 
throughout the text. A number of additional symbols are occasionally employed 
for convenience in particular expositions, and are defined where they occur. 

u;s = generic symbol for the ri signal-unit of a perceptron, or, in 
simple perceptrons, signal to the R-unit from the ¢ *” stimulus. 


i* gens ory unit 


& 
i] 


th ee 
a: =¢ association unit 


eth 
rf; =e response unit 


£;; = connection from unit ¢ to unit 7 
At = output signal from 4;. 
a: = output signal from c.. 
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r; = output signal from +r; . 
A: = sequence of response states occurring as outputs of a perceptron. 


c-. = signal transmitted to unit y from unit ¢ , on connection Ly; 
(measured at point of arrival at the terminal unit). 


t:, = transmission time of connection <;; 


‘J 
v,; = value of connection ¢;; (occasionally abbreviated to v7; in simple 
perceptrons, indicating the value of the connection from a; to 
the R-unit). 


Ns = number of S-units 
N, = number of A-units 
N, = number of R-units 


oc: = total input signal to the « * unit, The signal due to stimulus S$: 

is designated either by o;(S)) or by aw’ . If the tensor notation 

is employed, then ow; designates the vector of signale (~;, oc, sengtee, 
Similarly, cc” may be used to designate the vector («’, a/,..., acy) 


3! = component of «” consisting of the sum of all signals originating 
from the S-units. 


Z.” = component of ne! consisting of the sum of all signals originating 
from the A-units. 


(The vectors /; , B! » O: , and rad are defined analogously to the correspond- 
ing & vectors.) 


g(x) = functional notation for activity state of a simple A-unit. ¢ = 1 
ifa>o0, O otherwise. 


x = number of excitatory input-connections to an A-unit 

y = number of inhibitory input-connections to an A-unit 

6 = threshold (specifically, 9, = threshold of i i unit) 

S; = ra stimulus 

A = fae sequence of stimuli 

"a os sequence of stimuli up to, but not including, the terminal 
stimulus 

e; = normalized retinal area (or fraction of sensory points) covered by S- 

cj = common area (retinal intersection) of stimuli 5; and 5; 

W = stimulus world, or universe 

nm = number of stimuli in W 


N = number of admissible stimulus sequences, consisting of stimuli 
in W 
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R(W) = 
2g 
” = 
rol = 
jp * 
‘G = 
D: = 

‘(&) a 

Q:y = 

Q:; = 
23,05, = 


classification of stimuli in W , into two or more equivalence 
classes. 


response function, assigning possible R-unit states to each 
stimulus in W 


sign of classification of stimulus 5 7 (+1 or -1) in a binary 
classification, C (w) 


increment of reinforcement per connection (typically +1 or 0, 
in quantized systems) 


decay rate, generally applied to decaying values, but occasionally 
used in connection with other quantities which are subject to 
exponential decay. 


generalization coefficient; the change in the signal to an R-unit 
for stimulus S; as a result of applying a unit of positive re- 
inforcement (7 = +1) for stimulus 5; 


matrix of generalization coefficients, 9; ; 


probability that an A-unit, in a given class of perceptrons, 
responds to stimulus S; 
probability that a 4 - layer A-unit responds to 5: 
probability that an A-unit responds to the » stimulus in 
sequence d; 


probability that an A-unit responds both to 5; and to S; 


probability that an A-unit responds both to the Priaa stimulus of o/- 
and to the » stimulus of J- 


(The probability of joint response for an arbitrary number of stimuli, Q;; _,,, 
is similarly defined. When it is understood that the environment consists of 
stimulus sequences, as in discussions of cross-coupled perceptrons, the sub- 
scripts of the Q-functions are always understood to refer to stimulus 
sequences, rather than individual stimuli. ) 


p(x) = 
E(z) = 
o(z) = 
Pp 


Pc) 


mean of the random variable z 
expected value of z 
standard deviation of v 


probability, particularly probability of correct performance in 
a given experiment. 


notation commonly used for the probability that the random 
variable x has the value < ; equivalent to P(z =<) 
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T(S) = the transform obtained by applying transformation 7 to 
stimulus S 


= time 


7 = number of stimuli (or duration, in units At ) in a training 
sequence 


o,7, and / as prefixes indicate types of reinforcement systems. 


r.c.8. = reinforcement control system. 
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APPENDIX B 
LIST OF THEOREMS AND COROLLARIES 


This appendix contains those results which have been explicitly 


stated in the form of theorems, for convenient reference. Theorems are 


numbered by chapter and theorem number, in the order in which they 

originally appear. 

THEOREM 5.1: Given a retina with two-state (on or off) input signals, 
the class of elementary perceptrons for which a solution exists to 
every classification, ((W!, of possible environments, W , is non- 
empty. 

THEOREM 5.2: Given an elementary perceptron and a classification 
C(w), the following conditions are necessary but not sufficient for 
a solution to C(w) to exist: 

i) every stimulus must activate at least one A-unit; 

ii) there should be no subset of stimuli containing at least 
one member of each class, such that in the union of the 
responding A-unit sets, evcry A-unit has the same bias 


ratio (with respect to the stimuli of the subset). 


THEOREM 5.3: Given an elementary “-perceptron, a stimulus world 
W, and any classification C(w) ; then in order for a solution to C(W) 
to exist, it is necessary and sufficient that there exist some 
vector u in the same orthant as C(w), and some vector z such 
that Gz =u. 


COROLLARY 1: Given an elementary perceptron and a stimulus world 
w, then if G is singular, some C(wW) exists for which there is no 


solution. 


COROLLARY 2: Given an elementary perceptron, if the number of 
stimuli inW is n>N,, there is some C(w) for whichno solution 


exists. 
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COROLLARY 3: For any elementary perceptron, as the number 7- of 
stimuli in W inéreases, the probability that a randomly selected 
classification, C(w), has a solution approaches zero (where C (Ww) 
is choséh frém @ uniform distribution over the possible classifica - 
tions of W ). 


THEOREM 5.4: Given an elementary c& -perceptron, a stimulus world 
W, and any classification C(w) for which a solution exists; let all 
stimuli in w occur in any sequence, provided that each stimulus 
must reoccur in finite time; then beginning from an arbitrary 
initial state, an error correction procedure (quantized or non- 
quantized) will always yield a solution to C(w)in finite time, with all 
signals to the R-unit having magnitudes at least equal to an arbitrary 
quantity cd 20. 


COROLLARY: Given an elementary perceptron, a stimulus world W , 
and any classification C(w); then if a solution to C(w) exists, the set 
of possible solutions to C(w) has positive measure over the phase 


space of the perceptron. 


THEOREM 5.5: Given an elementary c=perceptron with a finite nurnber 
of memory states, a random-séquence stimulus world W , and any 
classification C(w) for which a solution can be reached from the 
starting point by some reinforcement sequence, then a solution 
will be obtained in finite time with probability 1 by means of a 


random-sign correction procedure. 


THEOREM 5.6: Given an elementary a@-perceptron, a stimulus world 
W, and some classification C(w) for which a solution exists, a 
solution can sometimes be achieved by an S-controlled reinforce- 
ment procedure. However, such a solution cannot be guaranteed 
for an arbitrary stimulus sequence, and may be unstable if it 


occurs. 
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THEOREM 5.7: Given an elementary perceptron with a finite number of 
memory states, a stimulus world W , and a classification C(w) for 
which a solution can be reached from the starting point by some 
reinforcement sequence, then a solution can always be obtained in 


finite time by means of a random perturbation correction procedure. 


THEOREM 5.8: Given an elementary 7 -perceptron, a stimulus world 
W, and a classification C(w), it is possible that a solution to C(W) 


exists which cannot be achieved by the perceptron. 


THEOREM 5.9: Given an o&-perceptron, and a classification C(w), 
a necessary and sufficient condition that the error correction 
procedure reach a solution (in finite time, with arbitrary starting 
point) is that there exists no non-zero vector X of (whose components 
do not disagree in sign with C(w) ) such that b,x* =O for all é 
(where 5; is the bias number, defined as in Chapter 5). 


COROLLARY: For an o-system, the condition that there exist no non- 
zero vector X* such that bx" = O for all é¢ is equivalent to the 
condition that there exist Z and / such that GZ = U (where U is 
in the same orthant as C(wW)). 


THEOREM 5.10: Givena 7-perceptron, and a classification C(w), a 
necessary and sufficient condition that the error correction procedure 
reach a solution (in finite time) is that there exists no non-zero y* 


such that 6,1" = ¢ for all é . 


COROLLARY; Fora 7-system, the condition that there exist no non- 
zero vector X* such that 4; x*s < for all c¢ is equivalent to the 
condition that there exist Z and U such that GZ = VU (where U is 
in the same orthant as C(w)). 


THEOREM 7.1: Given a class of elementary o{-perceptrons, a finite 


stimulus world W, aclassification C(w) , and a training sequence; 
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then for every ¢ >0O, there exists an “,(¢) such that if V,>A,(e), 
the probability of selecting a perceptron which will correctly 
identify the class of every positive stimulus will be greater than 
P=€ 

(see Page 157 for definition of positive stimulus. ) 


THEOREM 9.1: In a bounded « -perceptron, with S-controlled reinforce- 


ment, the probability distribution /7(1)(for the value of a particular 
connection) approaches a stable terminal distribution of the form 


1 -f 
(v)=c(£) where < is a normalization constant equal to 
f- (Ls ) 
1-(P/g EEF” 
THEOREM 10.1: Given a completely linear perceptron, a stimulus 


world W, anda classification C(w) such that the bias ratio of 


every S-unit is equal (and non-sero) no solution to C(w) can exist. 
THEOREM 10.2: Given a simple a -perceptron with simple A-units, 


an R-unit with a continuous monotonic sign-preserving signal 
generating function, a stimulus world W (in which each stimulus 
ultimately reoccurs) and any response function @(wW) for which a 
solution exists, then by means of the error-corrective reinforce- 
ment procedure, the given response function can always be 
approximated in finite time by an output vector P(W)+é, where 
€ is a vector (€,,6,, ... »,€,). le,|<e', where € may be an 


arbitrarily small quantity greater than zero. 


LEMMA 1: Given a symmetric positive definite or positive semi- 
definite matrix, 4 , and any vector 3 , then ( 3°43) = O only 
if H 3= QO. 

LEMMA Zz: For the same conditions as Theorem 10.2, given that a 
solution exists, the set of all solutions forms a hyperplane of 


dimension equal to the nullity of G 
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COROLLARY 1: For the conditions of Theorem 10.2, and a phase space 
which is unbounded in all dimensions, the probability of convergence 
to an arbitrarily close approximation to R(w) by means of a random- 
sign correction procedure or a random -perturbation correction 


procedure may be less than 1. 


COROLLARY 2: Given the conditions of Theorem 10.2, and a phase space 
bounded in all dimensions, then (given that a solution to @(W) exists 
in this bounded space) the response function can always be approximated 
by means of the random-sign correction procedure, the system converg- 
ing in finite time to an approximation R(W)#é€, € a vector, where 


\é,|< € for arbitrarily small €'>0. 


COROLLARY 3: Given the same conditions as Corollary 2, the response 
function can always be approximated by the random -perturbation 
correction procedure, the system converging in finite time to an 
approximation R(W)+€, € having components of magnitude |e,| & || 
if the reinforcement is quantized, or le; | <e€'> 0, if 4 is chosen 


from a continuous distribution around zero. 


THEOREM 10.3: Given a simple perceptron with a simple R-unit, and 
with transmission functions for all A-R connections of the form 
f(;)v;,, where ~ is any function, and given the existence of a 
solution to a classification function CW) for this perceptron, then 
if p(v) is any polynomial of odd degree in v7, there also exists a 


solution if the transmission function is changed to f(@,) p(v;,) . 


THEOREM 10.4: Given the perceptron of Theotem 10.3, if a solution 
exists for some transmission function f(«,)y,, a solution does not 


necessarily exist for the transmission function gaJa,, 9#f . 


THEOREM 10.5: Given a simple perceptron with A-R connections which 
differ in their transmission functions, or with uniform transmission 


functions but non-simple A-units, a response function R(W) may 


-599- 


Google 


have a solution which is unattainable by either the error correction 


procedure or the random-sign correction procedure. 


THEOREM 10.6: Given a simple perceptron with any mixture of trans- 
mission functions f; («;,2~;,) for the connections <;,, and a response 
function ?(w) for which a solution exists; then there exists some 
transmission function 9(«,7) which is uniform for all connections, 


such that a solution to ((w) exists. 


THEOREM 10.7: Given a simple perceptron with an R-unit which is either 
simple or has a continuous signal generating function, and with any 
combination of transmission functions from its A-units (all continu- 
ous functions of 7.., equal to zero if o&;=0), and given a bounded 
phase space within which a solution exists for 2(w); then, if each 
stimulus in W ultimately reoccurs, an approximate solution R(W) + € 
is always obtainable in finite time by the random -perturbation 


correction procedure. 


THEOREM 12.1: Given a perceptron with more than one R-unit, and a 
response function ((W) or a classification C(w) for which a solution 
exists, it may be impossible to achieve this solution by an error 
correction procedure which applies negative reinforcement jointly 


to all R-units based on errors in their joint response. 


THEOREM 13.1: Given a three-layer series-coupled perceptron with 
simple A and R-units and variable S-A connections, and a classi- 
fication C(W) for which a solution exists, it may be impossible to 
achieve a solution by any deterministic correction procedure which 


obeys the local information rule. 


THEOREM 13.2: Given a three-layer series-coupled perceptron, with 
simple A and R-units, variable-valued S-A connections, bounded 


A-R values, and a classification C(W) for which a solution exists, 
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then a solution to ©(w) can be obtained in finite time with 
probability 1 by means of a back-propagating error-correction 
procedure, given that each stimulus in W always reoccurs in 
finite time, and that probabilities © , »,, and pP, are all greater 
than 0 and less than 1. 


(See Section 13.3 for definition of the back-propagating correction 


procedure. ) 
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APPENDIX C 
BASIC EQUATIONS 


The following equations are those most likely to be referred to 
repeatedly, and are listed here in a somewhat different order from their 


appearance in the text. 
(1) Generalization Coefficients 
For an of -system, 
Gp * Nig 
E gy = @iy (normalized form) 
Fora 7 -system, 


ey = ne = (1/Nq) a ny 
EQ;; = Q:; - Q;Q; (normalized form) 


(2) R-unit Input Signals 


For an oc or 7 -system, 
u* Gx 


where u is the vector of R-unit input signals, and z,; =/.f; 


(f; being 


the number of times S- has been reinforced). 


(3) Q-Functions 


For individual stimuli, in a simple perceptron, 


Emax E-@ 

&=6 I=@ 
aa rc : { x for binomial model 
maz ~ \co for Poisson model 


P(E) = probability that £ excitatory connections to an 
A-unit originate from active S-points (see 
Equations 6. 2 and 6. 3) 


fp ; (I) = probability that I inhibitory connections to an 
A-unit originate from active S-points (see 
Equations 6. 2 and 6. 3) 
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Q;; = > PLE; s Eps Ez) OY Li Tj Lg) 


F-+E--I;-Ig 29 


where /) and 4 are defined by Equations 6.6 and 6.7, for binomial and 


Poisson models. 


For series -coupled perceptrons with distributed transmission 


times, see Sections 11.1 and 11.2 for prototype equations. 


For multi-layer series-coupled systems, Q-functions for the 


a layer can be computed by the approximation described in Section 15.1. 


For similarity -constrained four-layer perceptrons, @;; for 


ty 
two random or unrelated stimuli is given by: 
(2) 


ai? = [¥--0%)"] 


where m is the number of A‘” units connected to each Bond unit. 


For a stimulus S;, and its transform 5S,’ , in a similarity- 
constrained model, 

a = OP ar: 
where Qi? as 1-(1-Q/")” and Qi) can be approximated by Equations 15.5 


and 15. 8 for the case of random stimulus patterns in a finite retina. In an 
fa) 


ee : ‘ (2) : : 
infinite retina, with random stimuli, @;;; =Q;;. For coherent stimuli and 


assuming 7 to be a topological transformation, 
: (1) 
(2) m-1 m-! CO. ee Q;; (0) 
Qe: y= ee. + (1- 2 Boxe (1- Si”) 
where w is the order of the transformation group, and Qin.e is given 
by Equation 15.6. A particular solution for the case of square stimuli can 


be found in Equation 15.15. 


For cross-coupled perceptrons with fixed connections, @;, and 


ty 
Qu iy are given by Equations 18.1 and 18.2, respectively. 


For adaptive four-layer and cross-coupled systems, the terminal 
values of the Q-functions are obtained as a product of the iterative procedures 


described in Chapters 16, 17, and 19, and take the form: 
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Qi = >. P(A,) (4 + 74 (co) (Ay + 7 j (co) 
Ps 


(4) Equations for Learning -Performance 


For an error correction procedure, an upper bound on the 
number of corrections that will be required to achieve a solution from zero 


initial conditions is given by 

N € nM/a 
where 7 is the number of stimuliin w ,M is the maximum diagonal 
element Orc and o is the minimum of the function z’H x/\|z\\* as defined 


for Theorem 4, Chapter 5. For amore general bound, see Equation 7. 12. 


For an S-controlled learning procedure, in an elementary 
perceptron, a bound on the error probability for a "positive stimulus" ‘S_ 


is given by 
An improved estimate of the probability of correct response, employing a 
normal distribution assumption, is given by Equation 7. 7. 


For fixed training sequences, 
T Na 4; PQ: for an c& -system 
E(uyz) = 2 
TN, 24 P-(Q;,-Q2;Q,) fora Zor 7-system 
2 2 
Oo (uz) = T Ng ~ i Pu Pal (Qi 4, ~ Qi, Qgx) 
J 
for an -system, and 


= *(uz) ad “Na Pz 2 Py Pa Oe (ce ‘ Q; Fr @x) : 2Q, (Qjy 7 ., @,) 
J 


’ ( sx Q; Qx) (Qg,- Qe Qx)] 
fora 7 -system. The equation for a true J -system is given in Equation 8. 7. 


For random training sequences, ¢(,)is as above, and the variances 


are given by Equation 7.11 for an o&-system, and Equation 8. 14 for a 7 -system. 
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(5) Steady -State Equations for Four-Layer and Cross -Coupled Systems 
For an adaptive four-layer ©-perceptron (Chapter 16), the 


terminal va ues of the signals transmitted by the variable-valued connec - 


tions are given by iterating the equation: 


x! = Na” Yc. blair) 
(vet) o ry, “ey “(y) 


a n 
where 7,,, = O and Cis Fai Fe; (f, . being the frequency of the 


sequence § 45;): This equation will converge in at most 7 steps to the 


terminal value of ° . Equations for 7 and / -systems are presented 


in Chapter 16. 


For an open-loop cross-coupled system, the above iteration 


equation applies without modification. 


For a closed-loop cross-coupled «-perceptron, the iteration 
equation becomes 


r NN ¢ ¢ ee ae ee 
Piven = “GEE 0A ty) E008) O68 rion 0a 07)| 
J 


which is specific to the read A-unit, or to the set of A-units having the 
,3 -vector 7:;. The solutions for 7 and / ~-systems are discussed in 


Chapter 19. 
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APPENDIX D 
STANDARD DIAGNOSTIC EXPERIMENTS 


A number of experiments have been described in the course of 
the text which are employed for comparison and evaluation of different percep- 
tron models. Those experiments which are referred to by number are listed 


here for convenience in cross-referencing figures and discussions in the text. 


EXPERIMENT 1: Horizontal/vertical bar discrimination, in 20 by 20 
toroidally connected retina, with 4 by 20 bars. Stimuli occur in 
fixed sequence. S-controlled reinforcement is employed. 

(see Page 162) 


EXPERIMENT 2: Same environment and procedure as Experiment 1, but 
with alternating positions in opposite classes. (see Page 164) 


EXPERIMENT 3: Same as Experiment 1, but with stimuli occurring in 
random sequence. (see Page 170) 


EXPERIMENT 4: Same as Experiment 3, but horizontal bars occur four 
times as frequently as vertical bars. (see Page 170) 


EXPERIMENT 5: Same as Experiment 1, but with error-correction reinforce - 
ment. (see Page 173) 


EXPERIMENT 6: Same as Experiment 5, but with stimuli occurring in 
random sequence, (see Page 173) 


EXPERIMENT 7: Triangle/Square discrimination experiment, with error- 
correction procedure, in 20 by 20 retina, Random sequence, with 
stimuli occurring in all translational positions with equal probability. 
(see Page 173) 


EXPERIMENT 8: Horizontal/vertical bar discrimination, with random 
sequences, and random-sign correction procedure. (see Page 176) 


EXPERIMENT 9: Horizontal and vertical bars in random sequence, with R- 
controlled reinforcement. (see Page 214) 


EXPERIMENT 10: ''Spontaneous organization'' experiment, with an environ- 
ment of » etimuli, such that all pairs have equal intersections. The 
stimuli are divided into two classes, and the perceptron is exposed to 
a preconditioning sequence in which the transition probability between 
members of the same class is large, and the transition probability 
between classes is small. At the end of the preconditioning sequence, 
R-controlled reinforcement is applied for a brief period. (see 
Page 365) 
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EXPERIMENT 11: "Transformation learning" experiment, in which percep- 
tron is exposed to alternating preconditioning sequence of stimuli and 
their transforms. After the preconditioning period, the perceptron 
is taught to discriminate two test stimuli, which were not previously 
seen, and is then tested on their transforms. . (see Page 375) 


EXPERIMENT 12: The preconditioning sequence consists of a repetitive 
sequence of four stimuli, with spatial relationships favoring the 
dichotomy (5,,59) vs (5,,9,), while temporal association favors (S,,5,) 
ve (S5,5,). The Q-matrix is evaluated at the end of the preconditioning 
period. (see Page 393) 


EXPERIMENT 13: "Sequence prediction" experiment. The preconditioning 
procedure uses a finite sequence environment with the same stimuli as 
in Experiment 12, but the perceptron is tested (in addition) with the 
stimulus §, followed by a sequence of null stimuli, and the Q-matrix 
for all subsequences is obtained. (see Page 445) 


EXPERIMENT 14: Preconditioning procedure with same stimuli as in 
Experiment 12, but with each stimulus repeated two times whenever 
it occurs. The terminal Q-matrix for all subsequences is determined. 
(see Page 450) 


EXPERIMENT 15: Selective attention experiment, for a four R-unit percep- 
tron trained to discriminate shapes and retinal positions of stimuli, 
and then tested with complex stimuli combining two shapes and two 
positions simultaneously. (see Page 478) 


EXPERIMENT 16: Selective attention in an audio-visual perceptron, 
trained to discriminate shapes and positions as in Experiment 15, but 
biased by the addition of an auditory name for the shape or position 
of part of the stimulus pattern. (see Page 452) 
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